regex: matching 3 consecutive words - python

I'm trying to see if a string contains 3 consecutive words (divided by spaces and without numbers), but the regex I have constructed does not seem to work:
print re.match('([a-zA-Z]+\b){3}', "123 test bla foo")
None
This should return true since the string contains the 3 words "test bla foo".
What is the best way to achieve this?

Do:
(?:[A-Za-z]+ ){2}[A-Za-z]+
(?:[A-Za-z]+ ){2}: the non-captured group (?:[A-Za-z]+ ) matches one or more alphabetic characters followed by space, {2} matches two such successive groups
[A-Za-z]+ matches one or more alphabetic character after the preceding two words, making the third word
Demo
If you want the words to be separated by any whitespace instead of just space:
(?:[A-Za-z]+\s){2}[A-Za-z]+

I use this to select the first words of a string:
^(?:[^\ ]+\ ){3}
I use the whitespaces for define and delimite each words.
[^\ ]+: minimum one char except whitespaces, followed by an whitespace \.
After you juste have to enter the number of words you want : {3}
It works very well.

this is a much better option. It includes words with hyphens or apostrophe, like "don't" or "mother-in-law"
([^\s]+ ){2}[^\s]+

Related

Python Regex: Apostrophes only when placed within letters, not as quotation marks

define each word to be the longest contiguous sequence of alphabetic characters (or just letters), including up to one apostrophe if that apostrophe is sandwiched between two letters.
[a-z]+[a-z/'?a-z]*[a-z$]
It doesn't match the letter 'a'.
Something like this should work
[a-zA-Z]*(?:[a-zA-Z]\'[a-zA-Z]|[a-zA-Z])[a-zA-Z]*
Match 0 or more letters [a-zA-Z]*? followed by either an apostrophe surrounded by 2 letters or a single letter (?:[a-zA-Z]\'[a-zA-Z]|[a-zA-Z]) then match 0 or more letters [a-zA-Z]*
For just lowercase letters
[a-z]*(?:[a-z]\'[a-z]|[a-z])[a-z]*
I'd use:
^(?:[a-z]+|[a-z]+'[a-z]+)$
with re.IGNORECASE
Demo & explanation
You seem to misunderstand the character class notation. The stuff between [ and ] is a list of characters to match. It does not make sense to list the same character multiple times, and basically all characters except ] and - (and initial ^ for negation) simply match themselves, i.e. lose their regex special meaning.
Lets's rephrase your requirement. You want an alphabetic [a-z] repeated one or more times +, optionally followed by an apostrophe and another sequence of alphabetics.
[a-z]+('[a-z]+)?
In some regex dialects, you might prefer the non-capturing opening parenthesis (?: instead of plain (.

Regex to find and list any three characters enclosed dashes and last match in the string

My regex finds the three letters enclosed dashes but only returns the first second one in the string
(?:-)([A-Z]{3})+?(?:-)
I am trying to figure out a regex that finds all three letters enclosed in dashes only thus ignoring the first one ABC
ABC-FOUR-ONE-FIVE-TWO
Can there be a regex that lists only ONE and TWO (matches all except the first one
You may use
re.findall(r'-([A-Z]{3})(?![^-])', text)
Or, its equivalent
re.findall(r'-([A-Z]{3})(?=-|$)', text)
See the regex demo and Python demo
Pattern details
- - a hyphen
([A-Z]{3}) - Capturing group 1: three uppercase letters
(?=-|$) / (?![^-]) - match (but do not consume) a - or end of string position.
Try something like this (-[A-Za-z]{3}(-|$)) (tested it at https://regex101.com/)
This regex says: Match a dash, then 3 [A-Za-z] characters and then finally the "-" character or "end of string"

How can I use regex to search unicode texts and find words that contain repeated alphabets?

I have dataset which contains comments of people in Persian and Arabic. Some comments contain words like عاااالی which is not a real word and the right word is actually عالی. It's like using woooooooow! instead of WoW!.
My intention is to find these words and remove all extra alphabets. the only refrence I found is the code below which removes the words with repeated alphabets:
import re
p = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
s = "df\nAll aaaaaab the best 8965\nUS issssss is 123 good \nqqqq qwerty 1 poiks\nlkjh ggggqwe 1234 aqwe iphone5224s"
strs = s.split("\n")
print([p.sub("", x).strip() for x in strs])
I just need to replace the word with the one that has removed the extra repeated alphabets. you can use this sentence as a test case:
سلاااااام چطووووورین؟ من خیلی گشتم ولی مثل این کیفیت اصلاااااا ندیدممممم.
It has to be like this:
سلام چطورین؟ من خیلی گشتم ولی مثل این کیفیت اصلا ندیدم
please consider that more than 3 repeats are not acceptable.
You may use
re.sub(r'([^\W\d_])\1{2,}', r'\1', s)
It will replace chunks of identical consecutive letters with their single occurrence.
See the regex demo.
Details
([^\W\d_]) - Capturing group 1: any Unicode letter
\1{2,} - two or more repetitions of the same letter that is captured in Group 1.
The r'\1' replacement will only keep a single letter occurrence in the result.

Replace unwanted special characters from a string, retain special characters between two numerical

Hi I am working on one NLP project, where I need to identify entities / organization names from the text. However, the words in string are concatenated with (_ : ,) characters as shown below:
RING_LECO:108_.250X.436X.093V_772_520
I would want to clean the string as below:
Ring Leco 108 .250X.436X.093V 772_520
We have removed special characters between two words (A-Z:A-Z,A-Z:0-9) but retained _ symbol between 772 and 520.
Is there any way that I could do this?
Try using
(?<=\D)[_:,]|[_:,](?=\D)
\D represents a non-digit character, so the pattern matches special characters (_:,) that have a non-digit character on at least one side.
str = 'RING_LECO:108_.250X.436X.093V_772_520'
pattern = re.compile(r'(?<=\D)[_:,]|[_:,](?=\D)')
print(pattern.sub(' ', str))
Output:
RING LECO 108 .250X.436X.093V 772_520
This regex should do the trick:
_([^0-9]?<=)|(?=[^0-9])_
In English: "either before or after the underscore is not a number"
?<= means that whatever precedes is the group to capture
?= means that whatever follows is the group to capture

Python, Regular Expression: How to remove letter.letter(a.b) from string?

How can I remove combination of letter-dot-letter (example F.B) from string in python ? I tried using regex:
abre = re.sub(r"\b\w+\.\w+#",'',abre)
but it does not remove these sequences it just prints me the same unchanged string. I also tried removing all dots and then remove words smaller than 2 letters, but in that case I loose real words.
What I have: C.P.A. Certification Program, Accounting
What I want to get: Certification Program, Accounting
The length of the sequence is not always known and the letters are also unknown.
You seem to want to remove words that consist of dot-separated uppercase letters.
Use
abre = re.sub(r"\b(?:[A-Z]\.)+(?!\w)",'',abre)
See the regex demo. To also remove a trailing whitespace, you may add \s* at the end. If there must be at least two letters, replace + with {2,}.
Details:
\b - leading word boundary
(?:[A-Z]\.)+ - one or more sequences of
[A-Z] - an uppercase ASCII letter
\. -a dot
(?!\w) - not followed with a word char
you can use replace :
>>> string="rgoa.bwtg.rgqra.bergeg"
>>> string.replace("a.b", "")
'rgowtg.rgqrergeg'

Categories