How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')
The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']
I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary
^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.
Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"
Related
So I have been trying to construct a regex that can detect the pattern {word}{.,#}{word} and seperate it into [word,',' (or '.','#'), word].
But i am not able to create one that does strict matching for this pattern and ignores everything else.
I used the following regex
r"[\w]+|[.]"
this one is doing well , but it doesnt do strict matching, as in if (,, # or .) characters dont occur in text, it will still give me words, which i dont want.
I would like to have a regex which strictly matches the above pattern and gives me the splits(using re.findall) and if not returns the whole word as it is.
Please Note: word on either side of the {,.#} , both words are not strictly to be present but atleast one should be present
Some example text for reference:
no.16 would give me ['no','.','16']
#400 would give me ['#,'400']
word1.word2 would give me ['word1','.','word2']
Looking forward to some help and assistance from all regex gurus out there
EDIT:
I forgot to add this. #viktor's version works as needed with only one problem, It ignores ALL other words during re.findall
eg. ONE TWO THREE #400 with the viktor's regex gives me ['','#','400']
but what was expected was ['ONE','TWO','THREE','#',400]
this can be done with NLTK or spacy, but use of those is a limitation.
I suggest using
(\w+)?([.,#])((?(1)\w*|\w+))
See the regex demo.
Details
(\w+)? - An optional group #1: one or more word chars
([.,#]) - Group #2: ., , or #
((?(1)\w*|\w+)) - Group #3: if Group 1 matched, match zero or more word chars (the word is optional on the right side then), else, match one or more word chars (there must be a word on the right side of the punctuation chars since there is no word before them).
See the Python demo:
import re
pattern = re.compile(r'(\w+)?([.,#])((?(1)\w*|\w+))')
strings = ['no.16', '#400', 'word1.word2', 'word', '123']
for s in strings:
print(s, ' -> ', pattern.findall(s))
Output:
no.16 -> [('no', '.', '16')]
#400 -> [('', '#', '400')]
word1.word2 -> [('word1', '.', 'word2')]
word -> []
123 -> []
The answer to your edit is
if re.search(r'\w[.,#]|[.,#]\w', text):
print( re.findall(r'[.,#]|[^\s.,#]+', text) )
If there is a word char, then any of the three punctuation symbols, and then a word char again in the input string, you can find and extract all occurrences of the [.,#]|[^\s.,#]+ pattern, namely a ., , or #, or one or more occurrences of any one or more chars other than whitespace, ., , and #.
I hope this code will solve your problem if you want to split the string by any of the mentioned special characters:
a='no.16'
b='#400'
c='word1.word2'
lst=[a, b, c]
for elem in lst:
result= re.split('(\.|#|,)',elem)
while('' in result):
result.remove('')
print(result)
You could do something like this:
import re
str = "no.16"
pattern = re.compile(r"(\w+)([.|#])(\w+)")
result = list(filter(None, pattern.split(str)))
The list(filter(...)) part is needed to remove the empty strings that split returns (see Python - re.split: extra empty strings that the beginning and end list).
However, this will only work if your string only contains these two words separated by one of the delimiters specified by you. If there is additional content before or after the pattern, this will also be returned by split.
for string "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']", I want to find "#..'...'" like "#id~'objectnavigator-card-list'" or "#class~'outbound-alert-settings'". But when I use regex ((#.+)\~(\'.*?\')), it find "#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings'". So how to modify the regex to find the string successfully?
Use non-capturing, non greedy, modifiers on the inner brackets and search for not the terminating character, e.g.:
re.findall(r"((?:#[^\~]+)\~(?:\'[^\]]*?\'))", test)
On your test string returns:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
Limit the characters you want to match between the quotes to not match the quote:
>>> re.findall(r'#[a-z]+~\'[-a-z]*\'', x)
I find it's much easier to look for only the characters I know are going to be in a matching section rather than omitting characters from more permissive matches.
For your current test string's input you can try this pattern:
import re
a = "//div[#id~'objectnavigator-card-list']//li[#class~'outbound-alert-settings']"
# find everything which begins by '#' and neglect ']'
regex = re.compile(r'(#[^\]]+)')
strings = re.findall(regex, a)
# Or simply:
# strings = re.findall('(#[^\\]]+)', a)
print(strings)
Output:
["#id~'objectnavigator-card-list'", "#class~'outbound-alert-settings'"]
I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.
I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)
I'm having trouble matching a string with regexp (I'm not that experienced with regexp). I have a string which contains a forward slash after each word and a tag. An example:
led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION
In those strings, I am only interested in all strings that precede /PERSON. Here's the regexp pattern that I came up with:
(\w)*\/PERSON
And my code:
match = re.findall(r'(\w)*\/PERSON', string)
Basically, I am matching any word that comes before /PERSON. The output:
>>> reg
['Timothy', '', 'Geithner']
My problem is that the second match, matched to an empty string as for R./PERSON, the dot is not a word character. I changed my regexp to:
match = re.findall(r'(\w|.*?)\/PERSON', string)
But the match now is:
['led/O by/O Timothy', ' R.', ' Geithner']
It is taking everything prior to the first /PERSON which includes led/O by/O instead of just matching Timothy. Could someone please help me on how to do this matching, while including a full stop as an abbreviation? Or at least, not have an empty string match?
Thanks,
Match everything but a space character ([^ ]*). You also need the star (*) inside the capture:
match = re.findall(r'([^ ]*)\/PERSON', string)
Firstly, (\w|.) matches "a word character, or any character" (dot matches any character which is why you're getting those spaces).
Escaping this with a backslash will do the trick: (\w|\.)
Second, as #Ionut Hulub points out you may want to use + instead of * to ensure you match something but Regular Expressions work on the principle of "leftmost, longest" so it'll always try to match the longest part that it can before the slash.
If you want to match any non-whitespace character you can use \S instead of (\w|\.), which may actually be what you want.