Can't find error in my regex - python

I am trying to build a regex that is able to extract all Stackoverflow-like tags from a string. There is something wrong with my regex and I cant find what is:
s = 'call,me r c++ c# 132(list) 2345sdf;sdf_sfg? "adf-sdf aso.net?'
re.findall(r"[^,\s;\"\(\)]*[a-zA-Z0-9_\+\-\.#]*[a-zA-Z0-9_\+\-#]", s.lower())
I am getting
['call',
'r',
'c++',
'c#',
'132',
'list',
'2345sdf',
'sdf_sfg',
'adf-sdf',
'aso.net']
So as you see the "me" after the comma is missing. I am also open to improvements on my regex.
EDIT: The pattern I want to match are valid SO tags, i.e. all characters in the set [a-zA-Z0-9_+-.#]. The rest of my expression is a hack to exclude the dot at the end of the sentence and someworkaround to eliminate the comma.

>>> s = 'call,me r c++ c# 132(list) 2345sdf;sdf_sfg? "adf-sdf aso.net? foo. bar.'
>>> re.findall(r'\b\w[\w#+.-]*(?<!\.)', s)
['call', 'me', 'r', 'c++', 'c#', '132', 'list', '2345sdf', 'sdf_sfg', 'adf-sdf', 'aso.net', 'foo', 'bar']
I require tags to start after a word boundary with a word character. After that, I also capture as many word characters or those I explicitely listed (#+.-). So if you want to support another character, just add it to the character class.
The negative lookbehind at the end prevents tags from ending with a dot.

Related

Why does the regular expression r'[a|(an)|(the)]+' detect 'h' and 'he' separately instead of 'the' as a whole?

I am trying to find 'a', 'an', 'the' in a given text. And the expression r'[a|(an)|(the)]+' recognizes only 'a' but not 'an' and 'the'.
nltk.re_show(r'[a|(an)|(the)]+', 'sdfkisdfjstdskfhdsklfjkhe an skfjkla')
This gives me the output
sdfkisdfjs{t}dskf{h}dsklfjk{h}{e} {a}{n} skfjkl{a}
I also tried
nltk.re_show(r'[a|<an>|<the>]+', 'sdfkisdfjstdskfhdsklfjkhe an skfjkla')
I get an output
sdfkisdfjs{t}dskf{h}dsklfjk{he} {an} skfjkl{a}
I don't understand why 'h' and 'he' are recognized.
What could be the right regular expression in this case to recognize 'a', 'an' and 'the' in a given text?
Square and round braces don't have the same meaning. Square braces are used to specify "any one of the chars inside".
Note also that if you want to match "an", you don't want the capture to stop at "a", which means you have to reverse the order.
What you want instead of
[a|(an)|(the)]+
seems to be
(an|a|the)+
or maybe just
(an|a|the)
or (less readable)
(an?|the)
(yes, there are often many regexes for one problem)
Regex: the|an|a
Regex demo
import re
text = 'sdfkisdfjstdskfhdsklfjkhe an skfjkla a dsda the dsathekoo'
array = re.findall(r'the|an|a', text)
print(array)
Output: ['an', 'a', 'a', 'a', 'the', 'a', 'the']
Although this is an old post, the following could be relevant to someone looking for an answer. My solution is
teststring='he was trying to snatch the token from a guy standing on an escalator in the mall'
re.findall(r'( the | a | an )', teststring)
[' the ', ' a ', ' an ', ' the ']
The leading and trailing spaces provide the unique sequence, which is required for the the search so that it can avoid the 'an' embedded inside the word 'standing' for example. You can strip the spaces off from the result set later on for further processing.
Thanks

Regex parsing text and get relevant words / characters

I want to parse a file, that contains some programming language. I want to get a list of all symbols etc.
I tried a few patterns and decided that this is the most successful yet:
pattern = "\b(\w+|\W+)\b"
Using this on my text, that is something like:
string = "the quick brown(fox).jumps(over + the) = lazy[dog];"
re.findall(pattern, string)
will result in my required output, but I have some chars that I don't want and some unwanted formatting:
['the', ' ', 'quick', ' ', 'brown', '(', 'fox', ').', 'jumps', 'over',
' + ', 'the', ') = ', 'lazy', '[', 'dog']
My list contains some whitespace that I would like to get rid of and some double symbols, like (., that I would like to have as single chars. Of course I have to modify the \W+ to get this done, but I need a little help.
The other is that my regex doesn't match the ending ];, which I also need.
Why use \W+ for one or more, if you want single non-word characters in output? Additionally exclude whitespace by use of a negated class. Also it seems like you could drop the word boundaries.
re.findall(r"\w+|[^\w\s]", str)
This matches
\w+ one or more word characters
|[^\w\s] or one character, that is neither a word character nor a whitespace
See Ideone demo

Regex: how to identify words in a screen (or how to exclude punctuation and numbers)

Can someone help me with identifying words only in the text file? Upper or lower case but no numbers, brackets, dashes, punctuation, etc.(whatever the definition of the "word" is)
I was thinking about:
r"\w+ \w+"
but it does not work
Thank you
You can use a character class with specifying the range of expected characters :
r'[a-zA-Z]+'
Read more here http://www.regular-expressions.info/charclass.html
And in python you can use the function re.findall() to return all the matches in a list or re.finditer which returns an iterator of match objects.
re.findall(r"\b[a-z]+\b",test_str,re.I)
You can do it this way.
import re
text = "hey there 222 how are you ??? fine I hope!"
print re.findall("[a-z]+", subject, re.IGNORECASE)
#['hey', 'there', 'how', 'are', 'you', 'fine', 'I', 'hope']
Regex explanation
[a-z]+
Options: Case insensitive;
Match a single character in the range between “a” and “z” «[a-z]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Python Live Demo
http://ideone.com/JT8ZjD

Python - parsing user input using a verbose regex

I am try to design a regex the will parse user input, in the form of full sentences. I am stuggling to get my expression to fully work. I know it is not well coded but I am trying hard to learn. I am currently trying to get it to parse precent as one string see under the code.
My test "sentence" = How I'm 15.5% wholesome-looking U.S.A. we RADAR () [] {} you -- are, ... you?
text = input("please type somewhat coherently: ")
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\w+(?:[-']\w+)* # permit word-internal hyphens and apostrophes
|[-.(]+ # double hyphen, ellipsis, and open parenthesis
|\S\w* # any sequence of word characters
# |[\d+(\.\d+)?%] # percentages, 82%
|[][\{\}.,;"'?():-_`] # these are separate tokens
'''
parsed = re.findall(pattern, text)
print(parsed)
My output = ['How', "I'm", '15', '.', '5', '%', 'wholesome-looking', 'U.S.A.', 'we', 'RADAR', '(', ')', '[', ']', '{', '}', 'you', '--', 'are', ',', '...', 'you', '?']
I am looking to have the '15', '.', '5', '%' parsed as '15.5%'. The line that is currently commented out is what should do it, but when commented in does absolutly nothing. I searched for resources to help but they have not.
Thank you for you time.
If you just want to have the percentage match as a whole entity, you really should be aware that regex engine analyzes the input string and the pattern from left to right. If you have an alternation, the leftmost alternative that matches the input string will be chosen, the rest won't be even tested.
Thus, you need to pull the alternative \d+(?:\.\d+)? up, and the capturing group should be turned into a non-capturing or findall will yield strange results:
(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?% # percentages, 82% <-- PULLED UP OVER HERE
|\w+(?:[-']\w+)* # permit word-internal hyphens and apostrophes
|[-.(]+ # double hyphen, ellipsis, and open parenthesis
|\S\w* # any sequence of word characters#
|[][{}.,;"'?():_`-] # these are separate tokens
See regex demo.
Also, please note I replaced [][\{\}.,;"'?():-_`] with [][{}.,;"'?():_`-]: braces do not have to be escaped, and - was forming an unnecessary range from a colon (decimal code 58) and an underscore (decimal 95) matching ;, <, =, >, ?, #, all the uppercase Latin letters, [, \, ] and ^.

Separating RegEx pattern matches that have the same potential starting characters

I would like to have a RegEx that matches several of the same character in a row, within a range of possible characters but does not return those pattern matches as one pattern. How can this be accomplished?
For clarification:
I want a pattern that starts with [a-c] and ungreedly returns any number of the same character, but not the other characters in the range. In the sequence 'aafaabbybcccc' it would find patterns for:
('aa', 'aa', 'bb', 'b', 'cccc')
but would exclude the following:
('f', 'aabb', 'y', 'bcccc')
I don't want to use multiple RegEx pattern searches because the order that i find the patterns will determine the output of another function. This question is for the purposes of self study (python), not homework. (I'm also under 15 rep but will come back and upvote when I can.)
Good question. Use a regex like:
(?P<L>[a-c])(?P=L)+
This is more robust - you're not limited to a-c, you can replace it with a-z if you like. It first defines any character within a-c as L, then sees whether that character occurs again one or more times. You want to run re.findall() using this regex.
You can use backreference \1 - \9 to capture previously matched 1st to 9th group.
/([a-c])(\1+)/
[a-c]: Matches one of the character.
\1+ : Matches subsequent one or more previously matched character.
Perl:
perl -e '#m = "ccccbbb" =~ /([a-c])(\1+)/; print $m[0], $m[1]'
cccc
Python:
>>> import re
>>> [m.group(0) for m in re.finditer(r"([a-c])\1+", 'aafaabbybcccc')]
['aa', 'aa', 'bb', 'cccc']

Categories