Restricting re.findall for number of words in quotation

Restricting re.findall for number of words in quotation - python

I'm trying to get only the quotation out of a sentence - but! only if it's one or two words long. So for the sentence
mysentence = 'Kids, you "tried your best" and you failed miserably. The "lesson" is, "never try."'
The output should be
lesson
never try
So far I've got
import re
print(re.findall(r'"(.*?)"', mysentence))
Any suggestions how to solve this?

You can try this regex:
"[^"\s]+(?:\s[^"\s]+)?"
The " at the start and end matches the quotes beginning end ending the quoted word/phrase. and then we match one word: [^" ]+. [^" ] is any character that is not a quote or a space. I excluded spaces to make sure that this only matches a single word.
The next part is all in an optional group, because the second word is optional. The second word is a space followed by a single word: \s[^"\s]+.
Demo

You may use
"[^"\s\w]*(\w+(?:\s+\w+)?)[^"\s\w]*"
See the regex demo.
Details
" - a " char
[^"\s\w]* - 0+ non-word and non-whitespace chars other than "
(\w+(?:\s+\w+)?) - Group 1:
\w+ - 1+ word chars
(?:\s+\w+)? - an optional sequence of 1+ whitespace chars followed with 1+ word chars
[^"\s\w]* - 0+ non-word and non-whitespace chars other than "
" - a " char
Python demo:
import re
rx = r'"[^"\s\w]*(\w+(?:\s+\w+)?)[^"\s\w]*"'
s = 'Kids, you "tried your best" and you failed miserably. The "lesson" is, "never try."'
print( re.findall(rx, s) )

Try this:
"((?:\w+[ .]*){1,2})"
You can easily change necessary number of words to match by changing 2 to proper number.
See the demo.
" - a " char
((?:\w+[ .]*){1,2}) - Group 1:
(?:\w+[ .]*) - non-capturing group
\w+ - sequence of 1+ 'word' chars
[ .]* - optional chars set for word delimiter. In our case spaces and dots.
{1,2} - number of repeating 'from one to two' of non-capturing group
" - a " char
As variant, for word separators can be described as "0+ sequence of not a word char and not a " char". Like this [^"\w]*
For example:
"((?:\w+[^"]*){1,2})"
See the demo

Related

How to include hyphenated words in this regex? [duplicate]

I have the following regex for matching words:
\w+(?:'|\-\w+)?
For the following string:
' 's yea' don't -yeah no- ice-cream '
it gives the following matches:
s yea' don't yeah no ice-cream
However, I would like the following matches:
's yea' don't yeah no ice-cream
Since a word can start or end with an apostrophe but not with a hyphen. Note the a ' on its own should not be matched.

Your \w+(?:'|\-\w+)? starts matching with a word character \w, thus all "words" starting with ' are not matched as per the requirements.
In general, you can match words with and without hyphens with
\w+(?:-\w+)*
In the current scenario, you may include the \w and ' into a character class and use
'?\w[\w']*(?:-\w+)*'?
See the regex demo
If a "word" can only have 1 hyphen, replace * at the end with the ? quantifier.
Breakdown:
'? - optional apostrophe
\w - a word character
[\w']* - 0+ word character or an apostrophe
(?:-\w+)* - 0+ sequences of:
- - a hyphen
\w+ - 1+ word character
'? - optional apostrophe

regex - dont want to tokenize certain part of the input

I have to tokenize string where the string does not contain any word character and if that is "". But I cannot tokenize two words i.e. "START_CALL" and "END_CALL" which has "".
So far I came up with :
split_tokens = re.split(r'([\W_])', string_to_be_replaced)
But it is splitting all tokens with underscore(_) and splitting "START", "_", "CALL".
I can split on "START_CALL" and then do the split tokens in the sub-strings.
But would be interested to know is there a much elegant way for doing this?

You can use
(\W|_\b|\b_)
([\W_])(?<!\B_\B)
See the regex demo #1 / regex demo #2. Details:
( - start of a capturing group
\W| - any non-word char (a char other than letter, digit, some connector punctuation and most diacritic chars), or
_\b| - an underscore that is not followed with a word char, or
\b_ - an underscore that is not preceded with a word char
) - end of the group.
[\W_](?<!\B_\B) - any non-word char or _ that is not a _ both preceded and followed with word chars.

A word starting with t but ends with other than e

I am trying to create a regex that starts with t or T and doesn't end with e letter. I tried the code below so far, but it's not giving me the desirable result. Could anyone show me what is exactly missing here?
my_str = my_file.read()
word = re.findall("[tT].*[^e]$", my_str)
print(word)

You can use
\bt(?:[a-z]*[a-df-z])?\b
\bt[a-z]*\b(?<!e)
Just for completeness, here is a regex to match any word starting with a Cyrillic т and not ending with a Cyrillic е:
\bт[^\W\d_]*\b(?<!е)
See the regex demo #1, regex demo #2 and a Cyrillic regex demo.
If you need a case insensitive matching, add re.I:
re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I)
And a note on word boundaries: if the words can be glued to _ or digits, use letter boundaries rather than word boundaries:
r'(?<![a-z])t(?:[a-z]*[a-df-z])?(?![a-z])'
r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)' # Unicode letter boundaries
Regex details
\b - word boundary (start of string or a position immediately after a char other than a digit, letter, underscore)
(?<![a-z]) ((?<![^\W\d_]) is a Unicode aware equivalent) - a negative lookbehind that matches a location that is not immediately preceded with a letter
t - a t letter
(?:[a-z]*[a-df-z])? - an optional non-capturing group matching 0 or more letters and then a letter other than e
\b - word boundary
(?![a-z]) ((?![^\W\d_]) is a Unicode aware equivalent) - a negative lookahead that matches a location that is not immediately followed with a letter.
Also,
\bt[a-z]*\b(?<!e) matches a word boundary, t, any zero or more lowercase ASCII letters (any ASCII letters with re.I), then a word boundary marks the end of a word and the negative lookbehind (?<!e) fails the match if there is e at the end of the word
[^\W\d_]* - matches zero or more more Unicode letters.
See a Python demo:
import re
text = r't, train => main,teene!'
cyr_text = r'таня тане работе'
print( re.findall(r'\bt(?:[a-z]*[a-df-z])?\b', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bt[a-z]*\b(?<!e)', text, re.I) )
# => ['t', 'train']
print( re.findall(r'\bт[^\W\d_]*\b(?<!е)', cyr_text, re.I) )
# => ['таня']
print( re.findall(r'(?<![^\W\d_])т[^\W\d_]*(?![^\W\d_])(?<!е)', cyr_text, re.I) )
# => ['таня']

There is also another way of doing it:
re.findall(r"\b[Tt]+[a-zA-Z]*[^Ee\s]\b", my_str)

Maybe:
[\W]([Tt]\w*[^e])[\W]
Any non word character followed by (capture: Tt, some optional word characters, not e) followed by first non word character

Split String into alpha & punctuation with exceptions regex

I am trying to split a string into 2 parts : alphanum chars & special chars. I want to limit the occurence of the escape character
b.sc.... = ['b.sc.','...'] (Preserve "." inside word & outside word just once)
really???? = ['really','????'] (split when any other special char encountered)
I went through a lot of SO questions before posting here. I have come up with this till now: re.findall(r"[\w+|\-.+\w]+|\W+,text)`
How to proceed further?

You can use
[re.sub(r'([.-])+', r'\1', x) for x in re.findall(r'\w+(?:-+\w+)+|\w+(?:\.+\w+)*\.?|[^\w\s]+', text)]
See this regex demo
Details
\w+(?:-+\w+)+ - one or more word chars followed with one or more occurrences of - and one or more word chars
| - or
\w+(?:\.+\w+)*\.? - one or more word chars followed with one or more occurrences of . and one or more word chars and then an optional dot
| - or
[^\w\s]+ - one or more non-word and non-whitespace chars.
The re.sub(r'([.-])+', r'\1', x) part is a post-processing step to replace one or more consecutive . or - chars with a single occurrence.

Regex to match words followed by whitespace or punctuation

If I have the word india
MATCHES
"india!" "india!" "india." "india"
NON MATCHES "indian" "indiana"
Basically, I want to match the string but not when its contained within another string.
After doing some research, I started with
exp = "(?<!\S)india(?!\S)"
num_matches = len(re.findall(exp))
but that doesn't match the punctuation and I'm not sure where to add that in.

Assuming the objective is to match a given word (e.g., "india") in a string provided the word is neither preceded nor followed by a character that is not in the string " .,?!;" you could use the following regex:
(?<![^ .,?!;])india(?![^ .,?!;\r\n])
Demo
Python's regex engine performs the following operations
(?<! # begin a negative lookbehind
[^ .,?!;] # match 1 char other than those in " .,?!;"
) # end the negative lookbehind
india # match string
(?! # begin a negative lookahead
[^ .,?!;\r\n] # match 1 char other than those in " .,?!;\r\n"
) # end the negative lookahead
Notice that the character class in the negative lookahead contains \r and \n in case india is at the end of a line.

\"india(\W*?)\"
this will catch anything except for numbers and letters

Try this
^india[^a-zA-Z0-9]$
^ - Regex starts with India
[^a-zA-Z0-9] - not a-z, A-Z, 0-9
$ - End Regex

Try with:
r'\bindia\W*\b'
See demo
To ignore case:
re.search(r'\bindia\W*\b', my_string, re.IGNORECASE).group(0)

you may use:
import re
s = "india."
s1 = "indiana"
print(re.search(r'\bindia[.!?]*\b', s))
print(re.search(r'\bindia[.!?]*\b', s1))
output:
<re.Match object; span=(0, 5), match='india'>
None

If you also want to match the punctuation, you could use make use of a negated character class where you could match any char except a word character or a newline.
(?<!\S)india[^\w\r\n]*(?!\S)
(?<!\S) Assert a whitspace bounadry to the left
india Match literally
[^\w\r\n] Match 0+ times any char except a word char or a newline
(?!\S) Assert a whitspace boundary to the right
Regex demo

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Restricting re.findall for number of words in quotation - python

Related

How to include hyphenated words in this regex? [duplicate]

regex - dont want to tokenize certain part of the input

A word starting with t but ends with other than e

Split String into alpha & punctuation with exceptions regex

Regex to match words followed by whitespace or punctuation

Categories

Resources