python regex preserve specified special characters only [duplicate] - python

This question already has an answer here:
Learning Regular Expressions [closed]
(1 answer)
Closed 6 years ago.
I've been looking for a way to isolate special characters in a regex expression, but I only seem to find the exact opposite of what I'm looking for. So basically I want to is something along the lines of this:
import re
str = "I only want characters from the pattern below to appear in a list ()[]' including quotations"
pattern = """(){}[]"'-"""
result = re.findall(pattern, str)
What I expect from this is:
print(result)
#["(", ")", "[", "]", "'"]
Edit: thank you to whomever answered then deleted their comment with this regex that solved my problem:
pattern = r"""[(){}\[\]"'\-]"""

Why would you need regex for this when it can be done without regex?
>>> str = "I only want characters from the pattern below to appear in a list ()[]' including quotations"
>>> pattern = """(){}[]"'-"""
>>> [x for x in str if x in pattern]
['(', ')', '[', ']', "'"]

If it's for learning purposes (regex isn't really the best way here), then you can use:
import re
text = "I only want characters from the pattern below to appear in a list ()[]' including quotations"
output = re.findall('[' + re.escape("""(){}[]"'-""") + ']', text)
# ['(', ')', '[', ']', "'"]
Surrounding the characters in [ and ] makes it a regex character class and re.escape will escape any characters that have special regex meaning to avoid breaking the regex string (eg: ] terminating the characters early or - in a certain place causing it to act like a character range).

Several of the characters in your set have special meaning in regular expressions; to match them literally, you need to backslash-escape them.
pattern = r"""\(\)\{\}\[]"'-"""
Alternatively, you could use a character class:
pattern = """[]-[(){}"']"""
Notice also the use of a "raw string" r'...' to avoid having Python interpret the backslashes.

Related

What is the differences between these regular expressions: '^From .*#([^ ]*)' & '^From .*#(\S+)'? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I am learning regex in python. Meanwhile, on a stage, I produced the first regex statement and my tutorial says the second. Both produce the same result for the given string. What are the differences? What may be the string for, that these codes will produce different results?
>>> f = 'From m.rubayet94#gmail.com sat Jan'
>>> y = re.findall('^From .*#(\S+)',f); print(y)
['gmail.com']
>>> y = re.findall('^From .*#([^ ]*)',f); print(y)
['gmail.com']
[^ ]* means zero or more non-space characters.
\S+ means one or more non-whitespace characters.
It looks like you're aiming to match a domain name which may be part of an email address, so the second regex is the better choice between the two since domain names can't contain any whitespace like tabs \t and newlines \n, beyond just spaces. (Domain names can't contain other characters too, but that's beside the point.)
Here are some examples of the differences:
import re
p1 = re.compile(r'^From .*#([^ ]*)')
p2 = re.compile(r'^From .*#(\S+)')
for s in ['From eric#domain\nTo john#domain', 'From graham#']:
print(p1.findall(s), p2.findall(s))
In the first case, whitespace isn't handled properly: ['domain\nTo'] ['domain']
In the second case, you get a null match where you shouldn't: [''] []
One of the regexes uses [^ ] while the other uses (\S+). I assume that at that point you're trying to match against anything but a whitespace.
The difference between both expressions is that (\S+) will match against anything that isn't any whitespace chracters (whitespace characteres are [ \t\n\r\f\v], you can read more here). [^ ] will match against anything that isn't a single whitespace character (i.e. a whitespace produced by pressing the spacebar).

Regex replacement for strip()

Long time/first time.
I am a pharmacist by trade by am going through the motions of teaching myself how to code in a variety of languages that are useful to me for things like task automation at work, but mainly Python 3.x. I am working through the automatetheboringstuff eBook and finding it great.
I am trying to complete one of the practice questions from Chapter 7:
"Write a function that takes a string and does the same thing as the strip() string method. If no other arguments are passed other than the string to strip, then whitespace characters will be removed from the beginning and end of the string. Otherwise, the characters specified in the second argument to the function will be removed from the string."
I am stuck for the situation when the characters I want to strip appear in the string I want to strip them from e.g. 'ssstestsss'.strip(s)
#!python3
import re
respecchar = ['?', '*', '+', '{', '}', '.', '\\', '^', '$', '[', ']']
def regexstrip(string, _strip):
if _strip == '' or _strip == ' ':
_strip = r'\s'
elif _strip in respecchar:
_strip = r'\'+_strip'
print(_strip) #just for troubleshooting
re_strip = re.compile('^'+_strip+'*(.+)'+_strip+'*$')
print(re_strip) #just for troubleshooting
mstring = re_strip.search(string)
print(mstring) #just for troubleshooting
stripped = mstring.group(1)
print(stripped)
As it is shown, running it on ('ssstestsss', 's') will yield 'testsss' as the .+ gets all of it and the * lets it ignore the final 'sss'. If I change the final * to a + it only improves a bit to yield 'testss'. If I make the capture group non-greedy (i.e. (.+)? ) I still get 'testsss' and if exclude the character to be stripped from the character class for the capture group and remove the end string anchor (i.e. re.compile('^'+_strip+'*([^'+_strip+'.]+)'+_strip+'*') I get 'te' and if I don't remove the end string anchor then it obviously errors.
Apologies for the verbose and ramble-y question.
I deliberately included all the code (work in progress) as I am only learning so I realise that my code is probably rather inefficient, so if you can see any other areas where I can improve my code, please let me know. I know that there is no practical application for this code, but I'm going through this as a learning exercise.
I hope I have asked this question appropriately and haven't missed anything in my searches.
Regards
Lobsta
You (.+) is greedy, (by default). Just change it to non greedy, by using (.+?)
You can test python regex at this site
edit : As someone commented, (.+?) and (.+)? do not do the same thing : (.+?) is the non greedy version of (.+) while (.+)? matches or not the greedy (.+)
As I mentioned in my comment, you did not include special chars into the character class.
Also, the .* without a re.S / re.DOTALL modifier does not match newlines. You may avoid using it with ^PATTERN|PATTERN$ or \APATTERN|PATTERN\Z (note that \A matches the start of a string, and \Z matches the very end of the string, $ can match before the final newline symbol in a string, and thus, you cannot use $).
I'd suggest shrinking your code to
import re
def regexstrip(string, _strip=None):
_strip = r"\A[\s{0}]+|[\s{0}]+\Z".format(re.escape(_strip)) if _strip else r"\A\s+|\s+\Z"
print(_strip) #just for troubleshooting
return re.sub(_strip, '', string)
print(regexstrip(" ([no more stripping'] ) ", " ()[]'"))
# \A[\s\ \(\)\[\]\']+|[\s\ \(\)\[\]\']+\Z
# no more stripping
print(regexstrip(" ([no more stripping'] ) "))
# \A\s+|\s+\Z
# ([no more stripping'] )
See the Python demo
Note that:
The _strip argument is optional with a =None
The _strip = r"\A[\s{0}]+|[\s{0}]+\Z".format(re.escape(_strip)) if _strip else r"\A\s+|\s+\Z" inits the regex pattern: if _strip is passed, the symbols are put inside a [...] character class and escaped (since we cannot control the symbol positions much, it is the quickest easiest way to make them all treated as literal symbols).
With re.sub, we remove the matched substrings.

Split stacked entities using regex re.split in python

I am having trouble splitting continuous strings into more reasonable parts:
E.g. 'MarieMüller' should become 'Marie Müller'
So far I've used this, which works if no special characters occur:
' '.join([a for a in re.split(ur'([A-Z][a-z]+)', ''.join(entity)) if a])
This outputs for e.g. 'TinaTurner' -> 'Tina Turner', but doesn't work
for 'MarieMüller', which outputs: 'MarieMüller' -> 'Marie M \utf8 ller'
Now I came accros using regex \p{L}:
' '.join([a for a in re.split(ur'([\p{Lu}][\p{Ll}]+)', ''.join(entity)) if a])
But this produces weird things like:
'JenniferLawrence' -> 'Jennifer L awrence'
Could anyone give me a hand?
If you work with Unicode and need to use Unicode categories, you should consider using PyPi regex module. There, you have support for all the Unicode categories:
>>> import regex
>>> p = regex.compile(ur'(?<=\p{Ll})(?=\p{Lu})')
>>> test_str = u"Tina Turner\nMarieM\u00FCller\nJacek\u0104cki"
>>> result = p.sub(u" ", test_str)
>>> result
u'Tina Turner\nMarie M\xfcller\nJacek \u0104cki'
^ ^ ^
Here, the (?<=\p{Ll})(?=\p{Lu}) regex finds all locations between the lower- (\p{Ll}) and uppercase (\p{Lu}) letters, and then the regex.sub inserts a space there. Note that regex module automatically compiles the regex with regex.UNICODE flag if the pattern is a Unicode string (u-prefixed).
It won't work for extended character
You can use re.sub() for this. It will be much simpler
(?=(?!^)[A-Z])
For handling spaces
print re.sub(r'(?<=[^\s])(?=(?!^)[A-Z])', ' ', ' Tina Turner'.strip())
For handling cases of consecutive capital letters
print re.sub(r'(?<=[a-z])(?=[A-Z])', ' ', ' TinaTXYurner'.strip())
Ideone Demo
Regex Breakdown
(?= #Lookahead to find all the position of capital letters
(?!^) #Ignore the first capital letter for substitution
[A-Z]
)
Using a function constructed of Python's string operations instead of regular expressions, this should work:
def split_combined_words(combined):
separated = [combined[1]]
for letter in combined[1:]:
print letter
if (letter.islower() or (letter.isupper() and separated[-1].isupper())):
separated.append(letter)
else:
separated.extend((" ", letter))
return "".join(separated)

Replace any number of white spaces with a single white space [duplicate]

This question already has answers here:
Is there a simple way to remove multiple spaces in a string?
(27 answers)
Closed 1 year ago.
Is there a way to use replace with a regex denoting any number of any white space (blank but also tab) with something? I am trying the following to contract any extension of multiple white space to just one but it doesn't work:
mystring.replace('\s+', ' ')
You cannot use a regular expression in the replace() method for strings, you have to use the re module:
import re
mystring = re.sub(r'\s+', ' ', mystring)
Note the r prefix before the string literal, this makes sure that the backslash in your regular expressions is interpreted properly. It wouldn't actually make a difference here, but for different escape sequences it can cause serious problems. For example '\b' is a backspace character but r'\b' is a backslash followed by a 'b', which is used for matching word boundaries in regex.
Try using re.sub:
import re
result = re.sub('\s+', ' ', mystring)
You can use str.split and str.join, to use regex you need re.sub :
>>> ' '.join('f o o\t\t bar'.split())
'f o o bar'
Try something like this
import re
re.sub('\s+',' ',mystring)
Just like this.
import re
print re.sub(r'\s+', '_', 'hello there')
# => 'hello_there'

Match single quotes from python re

How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')
The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']
I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary
^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.
Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"

Categories