Python - parsing user input using a verbose regex - python

I am try to design a regex the will parse user input, in the form of full sentences. I am stuggling to get my expression to fully work. I know it is not well coded but I am trying hard to learn. I am currently trying to get it to parse precent as one string see under the code.
My test "sentence" = How I'm 15.5% wholesome-looking U.S.A. we RADAR () [] {} you -- are, ... you?
text = input("please type somewhat coherently: ")
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\w+(?:[-']\w+)* # permit word-internal hyphens and apostrophes
|[-.(]+ # double hyphen, ellipsis, and open parenthesis
|\S\w* # any sequence of word characters
# |[\d+(\.\d+)?%] # percentages, 82%
|[][\{\}.,;"'?():-_`] # these are separate tokens
'''
parsed = re.findall(pattern, text)
print(parsed)
My output = ['How', "I'm", '15', '.', '5', '%', 'wholesome-looking', 'U.S.A.', 'we', 'RADAR', '(', ')', '[', ']', '{', '}', 'you', '--', 'are', ',', '...', 'you', '?']
I am looking to have the '15', '.', '5', '%' parsed as '15.5%'. The line that is currently commented out is what should do it, but when commented in does absolutly nothing. I searched for resources to help but they have not.
Thank you for you time.

If you just want to have the percentage match as a whole entity, you really should be aware that regex engine analyzes the input string and the pattern from left to right. If you have an alternation, the leftmost alternative that matches the input string will be chosen, the rest won't be even tested.
Thus, you need to pull the alternative \d+(?:\.\d+)? up, and the capturing group should be turned into a non-capturing or findall will yield strange results:
(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?% # percentages, 82% <-- PULLED UP OVER HERE
|\w+(?:[-']\w+)* # permit word-internal hyphens and apostrophes
|[-.(]+ # double hyphen, ellipsis, and open parenthesis
|\S\w* # any sequence of word characters#
|[][{}.,;"'?():_`-] # these are separate tokens
See regex demo.
Also, please note I replaced [][\{\}.,;"'?():-_`] with [][{}.,;"'?():_`-]: braces do not have to be escaped, and - was forming an unnecessary range from a colon (decimal code 58) and an underscore (decimal 95) matching ;, <, =, >, ?, #, all the uppercase Latin letters, [, \, ] and ^.

Related

Escaping regex unicode string in Python

I have a user defined string.
I want to use it in regex with small improvement: search by three apostrophes instead of one.
For example,
APOSTROPHES = re.escape('\'\u2019\u02bc')
word = re.escape("п'ять")
word = ''.join([s if s not in APOSTROPHES else '[%s]' % APOSTROPHES for s in word])
It works good for latin, but for unicode list comprehension gives the following string:
"[\\'\\\\u2019\\\\u02bc]\xd0[\\'\\\\u2019\\\\u02bc]\xbf[\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc][\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8f[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x82[\\'\\\\u2019\\\\u02bc]\xd1[\\'\\\\u2019\\\\u02bc]\x8c"
Looks like it finds backslashes in both strings and then substitutes APOSTROPHES
Also, print(list(w for w in APOSTROPHES)) gives ['\\', "'", '\\', '\\', 'u', '2', '0', '1', '9', '\\', '\\', 'u', '0', '2', 'b', 'c'].
How can I avoid it? I want to get "\п[\'\u2019\u02bc]\я\т\ь"
What I understand is: you want to create a regular expression which can match a given word with any apostrophe:
The RegEx which match any apostrophe can be defined in a group:
APOSTROPHES_REGEX = r'[\'\u2019\u02bc]'
For instance, you have this (Ukrainian?) word which contains a single quote:
word = "п'ять"
EDIT: If your word contains another kind of apostrophe, you can normalize it, like this:
word = re.sub(APOSTROPHES_REGEX , r"\'", word, flags=re.UNICODE)
To create a RegEx, you escape this string (because in some context, it can contains special characters like punctuation, I think). When escaped, the single quote "'" is replaced by an escaped single quote, like this: r"\'".
You can replace this r"\'" by your apostrophe RegEx:
import re
word_regex = re.escape(word)
word_regex = word_regex.replace(r'\'', APOSTROPHES_REGEX)
The new RegEx can then be used to match the same word with any apostrophe:
assert re.match(word_regex, "п'ять") # '
assert re.match(word_regex, "п’ять") # \u2019
assert re.match(word_regex, "пʼять") # \u02bc
Note: don’t forget to use the re.UNICODE flag, it will help you for some RegEx characters classes like r"\w".

Regex replacement for strip()

Long time/first time.
I am a pharmacist by trade by am going through the motions of teaching myself how to code in a variety of languages that are useful to me for things like task automation at work, but mainly Python 3.x. I am working through the automatetheboringstuff eBook and finding it great.
I am trying to complete one of the practice questions from Chapter 7:
"Write a function that takes a string and does the same thing as the strip() string method. If no other arguments are passed other than the string to strip, then whitespace characters will be removed from the beginning and end of the string. Otherwise, the characters specified in the second argument to the function will be removed from the string."
I am stuck for the situation when the characters I want to strip appear in the string I want to strip them from e.g. 'ssstestsss'.strip(s)
#!python3
import re
respecchar = ['?', '*', '+', '{', '}', '.', '\\', '^', '$', '[', ']']
def regexstrip(string, _strip):
if _strip == '' or _strip == ' ':
_strip = r'\s'
elif _strip in respecchar:
_strip = r'\'+_strip'
print(_strip) #just for troubleshooting
re_strip = re.compile('^'+_strip+'*(.+)'+_strip+'*$')
print(re_strip) #just for troubleshooting
mstring = re_strip.search(string)
print(mstring) #just for troubleshooting
stripped = mstring.group(1)
print(stripped)
As it is shown, running it on ('ssstestsss', 's') will yield 'testsss' as the .+ gets all of it and the * lets it ignore the final 'sss'. If I change the final * to a + it only improves a bit to yield 'testss'. If I make the capture group non-greedy (i.e. (.+)? ) I still get 'testsss' and if exclude the character to be stripped from the character class for the capture group and remove the end string anchor (i.e. re.compile('^'+_strip+'*([^'+_strip+'.]+)'+_strip+'*') I get 'te' and if I don't remove the end string anchor then it obviously errors.
Apologies for the verbose and ramble-y question.
I deliberately included all the code (work in progress) as I am only learning so I realise that my code is probably rather inefficient, so if you can see any other areas where I can improve my code, please let me know. I know that there is no practical application for this code, but I'm going through this as a learning exercise.
I hope I have asked this question appropriately and haven't missed anything in my searches.
Regards
Lobsta
You (.+) is greedy, (by default). Just change it to non greedy, by using (.+?)
You can test python regex at this site
edit : As someone commented, (.+?) and (.+)? do not do the same thing : (.+?) is the non greedy version of (.+) while (.+)? matches or not the greedy (.+)
As I mentioned in my comment, you did not include special chars into the character class.
Also, the .* without a re.S / re.DOTALL modifier does not match newlines. You may avoid using it with ^PATTERN|PATTERN$ or \APATTERN|PATTERN\Z (note that \A matches the start of a string, and \Z matches the very end of the string, $ can match before the final newline symbol in a string, and thus, you cannot use $).
I'd suggest shrinking your code to
import re
def regexstrip(string, _strip=None):
_strip = r"\A[\s{0}]+|[\s{0}]+\Z".format(re.escape(_strip)) if _strip else r"\A\s+|\s+\Z"
print(_strip) #just for troubleshooting
return re.sub(_strip, '', string)
print(regexstrip(" ([no more stripping'] ) ", " ()[]'"))
# \A[\s\ \(\)\[\]\']+|[\s\ \(\)\[\]\']+\Z
# no more stripping
print(regexstrip(" ([no more stripping'] ) "))
# \A\s+|\s+\Z
# ([no more stripping'] )
See the Python demo
Note that:
The _strip argument is optional with a =None
The _strip = r"\A[\s{0}]+|[\s{0}]+\Z".format(re.escape(_strip)) if _strip else r"\A\s+|\s+\Z" inits the regex pattern: if _strip is passed, the symbols are put inside a [...] character class and escaped (since we cannot control the symbol positions much, it is the quickest easiest way to make them all treated as literal symbols).
With re.sub, we remove the matched substrings.

Regex parsing text and get relevant words / characters

I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.
Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo

Regex to ignore specific characters

I am parsing a text on non alphanumeric characters and would like to exclude specific characters like apostrophes, dash/hyphens and commas.
I would like to build a regex for the following cases:
non-alphanumeric character, excluding apostrophes and hypens
non-alphanumeric character, excluding commas,apostrophes and hypens
This is what i have tried:
def split_text(text):
my_text = re.split('\W',text)
# the following doesn't work.
#my_text = re.split('([A-Z]\w*)',text)
#my_text = re.split("^[a-zA-Z0-9]+(-[a-zA-Z0-9]+)*$",text)
return my_text
Case 1:
Sample Input: What's up? It's good to see you my-friend. "Hello" to-the world!.
Sample Output: ['What's','up','It's','good','to','see','you','my-friend','Hello','to-the','world']
Case 2:
Sample Input: It means that, it's not good-to do such things.
Sample Output: ['It', 'means', 'that,', 'it's', 'not', 'good-to', 'do', 'such', 'things']
Any ideas
is this what you want?
non-alphanumeric character, excluding apostrophes and hypens
my_text = re.split(r"[^\w'-]+",text)
non-alphanumeric character, excluding commas,apostrophes and hypens
my_text = re.split(r"[^\w-',]+",text)
the [] syntax defines a character class, [^..] "complements" it, i.e. it negates it.
See the documentation about that:
Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.
You can use a negated character class for this:
my_text = re.split(r"[^\w'-]+",text)
or
my_text = re.split(r"[^\w,'-]+",text) # also excludes commas

Can't find error in my regex

I am trying to build a regex that is able to extract all Stackoverflow-like tags from a string. There is something wrong with my regex and I cant find what is:
s = 'call,me r c++ c# 132(list) 2345sdf;sdf_sfg? "adf-sdf aso.net?'
re.findall(r"[^,\s;\"\(\)]*[a-zA-Z0-9_\+\-\.#]*[a-zA-Z0-9_\+\-#]", s.lower())
I am getting
['call',
'r',
'c++',
'c#',
'132',
'list',
'2345sdf',
'sdf_sfg',
'adf-sdf',
'aso.net']
So as you see the "me" after the comma is missing. I am also open to improvements on my regex.
EDIT: The pattern I want to match are valid SO tags, i.e. all characters in the set [a-zA-Z0-9_+-.#]. The rest of my expression is a hack to exclude the dot at the end of the sentence and someworkaround to eliminate the comma.
>>> s = 'call,me r c++ c# 132(list) 2345sdf;sdf_sfg? "adf-sdf aso.net? foo. bar.'
>>> re.findall(r'\b\w[\w#+.-]*(?<!\.)', s)
['call', 'me', 'r', 'c++', 'c#', '132', 'list', '2345sdf', 'sdf_sfg', 'adf-sdf', 'aso.net', 'foo', 'bar']
I require tags to start after a word boundary with a word character. After that, I also capture as many word characters or those I explicitely listed (#+.-). So if you want to support another character, just add it to the character class.
The negative lookbehind at the end prevents tags from ending with a dot.

Categories