This question already has an answer here:
Python regex bad character range.
(1 answer)
Closed 2 years ago.
I am comparing two strings but excluding the punctuation marks in both.
Here is my code snippet:
punctuation = r"[.?!,;:-']"
string1 = re.sub(punctuation, r"", string1)
string2 = re.sub(punctuation, r"", string2)
After running this code I get following exception
bad character range :-' at position 6
How to get rid of this exception? What's the meaning of "bad character range"?
- has special meaning inside [] in regular expression pattern - for example [A-Z] are ASCII uppercase letters (from A to Z), so if you need literal - you need to escape it i.e.
punctuation = r"[.?!,;:\-']"
I also want to point regex101.com which is useful for testing regular patterns.
A - inside a character class [...] is used to denote a range of characters, for example: [0-9] would be equivalent to [0123456789].
Here, the :-' would mean any character between : and '. However, if you look up the character numbers, you see that they are in the wrong order for that to be a valid range:
>>> ord(":")
58
>>> ord("'")
39
In the opposite order '-: (inside the []) it would be a valid character range.
In any case, it is not what you want. You want the - to be interpreted as a literal - character.
There are two ways to achieve this. Either:
escape the - by writing \-
or put the - as the first or last character inside the [], e.g. r"[.?!,;:'-]"
Related
I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:
It always begins with the letter C, in either lowercase or
uppercase, which is then followed by a number of hexadecimal
characters (meaning it can contain the letters A to F and numbers
from 1 to 9, with no zeros included).
After those hexadecimal
characters comes a letter P, also either in lowercase or uppercase
And then some more hexadecimal characters (again, excluding 0).
Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P
Examples of valid strings would be:
c45AFP2
CAPF
c56Bp26
CA6C22pAAA
For the above examples what I want would be to extract the following, in the same order:
45AF2 # Original string: c45AFP2
AF # Original string: CAPF
56B26 # Original string: c56Bp26
A6C22AAA # Original string: CA6C22pAAA
Examples of invalid strings would be:
BCA6C22pAAA # It doesn't begin with C
c56Bp # There aren't any characters after P
c45AF0P2 # Contains a zero
I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P
So far I've come up with this:
(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*
A breakdown would be:
(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
But with the above regex I can't match any of the strings!
When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.
Meaning the below regex works:
(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*
I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?
But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.
That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.
Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?
I still need it to:
Not be a match if the string contains the number 0
Only be a match if ALL conditions are met
Thank you
To match both groups before and after P or p
(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
(?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
[1-9a-fA-F]+ - Matches hexadecimal characters one or more times
(?=[Pp] - Positive Lookahead for case insensitive p or P
([1-9a-fA-F]+$) - Cature group for one or more hexadecimal characters following the pP
View Demo
Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).
Also, the final quantifier should be + not * because you require at least one trailing character after the p.
The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.
If you use the case insensitive flag, it makes the regex much smaller and easier to read.
A working regex that captures the 2 hex parts in groups 1 and 2 is:
(?i)^c([a-f1-9]*)p([a-f1-9]+)
See live demo.
Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 1 year ago.
What does this pattern (?<=\w)\W+(?=\w) mean in a Python regular expression?
#l is a list
print(re.sub("(?<=\w)\W+(?=\w)", " ", l))
Here's a breakdown of the elements:
\w means an alphanumeric character
\W+ is the opposite of \w; with the + it means one or more non-alphanumeric characters
?<= is called a "lookbehind assertion"
?= is a "lookahead assertion"
So this re.sub statement means "if there are one or more non-alphanumeric characters with an alphanumeric character before and after, replace the non-alphanumeric character(s) with a space".
And by the way, the third argument to re.sub must be a string (or bytes-like object); it can't be a list.
Just put it into a site like regex101.com and hover the cursor over the parts.
https://regex101.com/r/JtrWIw/1
It would match non-word chars between word chars. Bits between the last 'd' of 'word' and the first 'w' of 'word' from the string below as an example...
word^&*((*&^%$%^&*& ^%$£%^&**&^%$£!"£$%^&*()word
Example:
import re
#if it is a list...
l = ['John Smith', 'This%^&*(string', 'Never!£$Mind^&*I$?/Solved{}][]It']
#l is a list
print(re.sub(r"(?<=\w)\W+(?=\w)", " ", l[2]))
Never Mind I Solved It
How to write a regular expression which can handle the following substitution scenario.
Hello, this is a ne-
w line of text wher-
e we are trying hyp-
henation.
i have a short Python code which handles breaking long one_line strings into a multi_line string and produces output similar to the code sample given above
I want a regular expression that takes care of the single hyphenated character like in first and second line and just pulls up the single hyphenated character on the previous like.
something like re.sub("-\n<any character>","<the any character>\n")
I can not find a way on how to handle the hyphenated character
below is some further information about the question
Word = "Python string comparison is performed using the characters in both strings. The characters in both strings are compared one by one."
def hyphenate(word, x):
for i in range(x, len(word), x):
word = word[:i] + ("\n" if (word[i] == " " or word[i-1] == " " ) else "-\n") + (word[i:] if word[i] != " " else word[(i+1):])
return(word)
print(hyphenate(Word, 20))
#Produced output
Python string compar-
ison is performed
using the character- <=
s in both strings.
The characters in b- <=
oth strings are co-
mpared one by one.
#Desired output
Python string compar-
ison is performed
using the characters <=
in both strings.
The characters in <=
both strings are co-
mpared one by one.
You don't need to include the trailing character at all.
re.sub(r'-\n', '')
If for some reason you do need to capture the character, you can use r'\1' to refer back to it.
re.sub(r'-\n([aeiou])', r'\1')
The notation r'...' produces a "raw string" where backslashes only represent themselves. In Python, backslashes in strings are otherwise processed as escapes - for example, '\n' represents the single wharacter newline, whereas r'\n' represents the two literal characters backslash and n (which in a regex match a literal newline).
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
Suppose a string:
s = 'F3·Compute·Introduction to Methematical Thinking.pdf'
I substitute F3·Compute· with '' using regex
In [23]: re.sub(r'F3?Compute?', '',s)
Out[23]: 'F3·Compute·Introduction to Methematical Thinking.pdf'
It failed to work as I intented
When tried,
In [21]: re.sub(r'F3·Compute·', '', 'F3·Compute·Introduction to Methematical Thinking.pdf')
Out[21]: 'Introduction to Methematical Thinking.pdf'
What's the problem with my regex pattern?
The question mark ? does not stand in for a single character in regular expressions. It means 0 or 1 of the previous character, which in your case was 3 and e. Instead, the . is what you're looking for. It is a wildcard that stands for a single character (and has nothing to do with your middle-dot character; that is just coincidence).
re.sub(r'F3.Compute.', '',s)
Use dot to match any single character:
#coding: utf-8
import re
s = 'F3·Compute·Introduction to Methematical Thinking.pdf'
output = re.sub(r'F3.Compute.', '', unicode(s,"utf-8"), flags=re.U)
print output
Your original pattern, 'F3?Compute? was not having the desired effect. This said to match F followed by the number 3 optionally. Also, you made the final e of Compute optional. In any case, you were not accounting for the separator characters.
Note also that we must match on the unicode version of the string, and not the string directly. Without doing this, a dot won't match the unicode separator which you are trying to target. Have a look at the demo below for more information.
Demo
NOTE: This post is not the same as the post "Re.sub not working for me".
That post is about matching and replacing ANY non-alphanumeric substring in a string.
This question is specifically about matching and replacing non-alphanumeric substrings that explicitly show up at the beginning of a string.
The following method attempts to match any non-alphanumeric character string "AT THE BEGINNING" of a string and replace it with a new string "BEGINNING_"
def m_getWebSafeString(self, dirtyAttributeName):
cleanAttributeName = ''.join(dirtyAttributeName)
# Deal with beginning of string...
cleanAttributeName = re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
# Deal with end of string...
if "BEGINNING_" in cleanAttributeName:
print ' ** ** ** D: "{}" ** ** ** C: "{}"'.format(dirtyAttributeName, cleanAttributeName)
PROBLEM DESCRIPTION: The method seems to not only replace non-alphnumeric characters but it also incorrectly inserts the "BEGINNING_" string at the beginning of all strings that are passed into it. In other words...
GOOD RESULT: If the method is passed the string *##$ThisIsMyString1, it correctly returns BEGINNING_ThisIsMyString1
BAD/UNWANTED RESULT: However, if the method is passed the string ThisIsMyString2 it incorrectly (and always) inserts the replacement string (BEGINNING_), even there are no non-alphanumeric characters, and yields the result BEGINNING_ThisIsMyString2
MY QUESTION: What is the correct way to write the re.sub() line so it only replaces those non-alphnumeric characters at the beginning of the string such that it does not always insert the replacement string at the beginning of the original input string?
You're matching 0 or more instances of non-alphabetic characters by using the * quantifier, which means it'll always be picked up by your pattern. You can replace what you have with
re.sub('^[^a-zA-Z]+', ...)
to ensure that only 1 or more instances are matched.
replace
re.sub('^[^a-zA-z]*',"BEGINNING_",cleanAttributeName)
with
re.sub('^[^a-zA-z]+',"BEGINNING_",cleanAttributeName)
There is a more elegant solution. You can use this
re.sub('^\W+', 'BEGINNING_', cleanAttributeName)
\W Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
>>> re.sub('^\W+', 'BEGINNING_', '##$ThisIsMyString1')
'BEGINNING_ThisIsMyString1'
>>> re.sub('^\W+', 'BEGINNING_', 'ThisIsMyString2')
'ThisIsMyString2'