Regext to match capitalized word, and the surrounding +- 4 words - python

I have a bunch of documents and I'm interested in finding mentions of clinical trials. These are always denoted by the letters being in all caps (e.g. ASPIRE). I want to match any word in all caps, greater than three letters. I also want the surrounding +- 4 words for context.
Below is what I currently have. It kind of works, but fails the test below.
import re
pattern = '((?:\w*\s*){,4})\s*([A-Z]{4,})\s*((?:\s*\w*){,4})'
line = r"Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY."
re.findall(pattern, line)

You may use this code in python that does it in 2 steps. First we split input by 4+ letter capital words and then we find upto 4 words on either side of match.
import re
str = 'Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY'
re1 = r'\b([A-Z]{4,})\b'
re2 = r'(?:\s*\w+\b){,4}'
arr = re.split(re1, str)
result = []
for i in range(len(arr)):
if i % 2:
result.append( (re.search(re2, arr[i-1]).group(), arr[i], re.search(re2, arr[i+1]).group()) )
print result
Code Demo
Output:
[('Lorem', 'IPSUM', ' is simply'), (' is simply', 'DUMMY', ' text of the printing'), (' text of the printing', 'INDUSTRY', '')]

Would the following regex works for you?
(\b\w+\b\W*){,4}[A-Z]{3,}\W*(\b\w+\b\W*){,4}
Tested here: https://regex101.com/r/nTzLue/1/

On the left side you could match any word character \w+ one or more times followed by any non word characters \W+ one or more times. Combine those two in a non capturing group and repeat that 4 times {4} like (?:\w+\W+){4}
Then capture 3 or more uppercase characters in a group ([A-Z]{3,}).
Or the right side you could then turn the matching of the word and non word characters around of what you match on the left side (?:\W+\w+){4}
(?:\w+\W+){4}([A-Z]{3,})(?:\W+\w+){4}
The captured group will contain your uppercase word and the on capturing groups will contain the surrounding words.

This should do the job:
pattern = '(?:(\w+ ){4})[A-Z]{3}(\w+ ){5}'

Related

Python regex take characters around word

I would like to include 5 characters before and after a specific word is matched in my regex query. Those words are in a list and I iterate over it.
See example below, this is what I tried:
import re
text = "This is an example of quality and this is true."
words = ['example', 'quality']
words_around = []
for word in words:
neighbors = re.findall(fr'(.{0,5}{word}.{0,5})', str(text))
words_around.append(neighbors)
print(words_around)
The output is empty. I would expect an array containing ['s an exmaple of q', 'e of quality and ']
You can use PyPi regex here that allows an infinite length lookbehind patterns:
import regex
import pandas as pd
words = ['example', 'quality']
df = pd.DataFrame({'col':[
"This is an example of quality and this is true.",
"No matches."
]})
rx = regex.compile(fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))')
def extract_regex(s):
return ["".join(x) for x in rx.findall(s)]
df['col2'] = df['col'].apply(extract_regex)
Output:
>>> df
col col2
0 This is an example of quality and this is true. [s an example of q, e of quality and ]
1 No matches. []
Both the pattern and how it is used are of importance.
The fr'(?<=(.{{0,5}}))({"|".join(words)})(?=(.{{0,5}}))' part defines the regex pattern. This is a "raw" f-string literal, f makes it possible to use variables inside the string literal, but it also requires to double all literal braces inside it. The pattern - given the current words list - looks like (?<=(.{0,5}))(example|quality)(?=(.{0,5})), see its demo online. It captures 0-5 chars before the words inside a positive lookbehind, then captures the words, and then captures the next 0-5 chars in a positive lookahead (lookarounds are used to make sure any overlapping matches are found).
The ["".join(x) for x in rx.findall(s)] part joins the groups of each match into a single string, and returns a list of matches as a result.

How can I use regex to search unicode texts and find words that contain repeated alphabets?

I have dataset which contains comments of people in Persian and Arabic. Some comments contain words like عاااالی which is not a real word and the right word is actually عالی. It's like using woooooooow! instead of WoW!.
My intention is to find these words and remove all extra alphabets. the only refrence I found is the code below which removes the words with repeated alphabets:
import re
p = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
s = "df\nAll aaaaaab the best 8965\nUS issssss is 123 good \nqqqq qwerty 1 poiks\nlkjh ggggqwe 1234 aqwe iphone5224s"
strs = s.split("\n")
print([p.sub("", x).strip() for x in strs])
I just need to replace the word with the one that has removed the extra repeated alphabets. you can use this sentence as a test case:
سلاااااام چطووووورین؟ من خیلی گشتم ولی مثل این کیفیت اصلاااااا ندیدممممم.
It has to be like this:
سلام چطورین؟ من خیلی گشتم ولی مثل این کیفیت اصلا ندیدم
please consider that more than 3 repeats are not acceptable.
You may use
re.sub(r'([^\W\d_])\1{2,}', r'\1', s)
It will replace chunks of identical consecutive letters with their single occurrence.
See the regex demo.
Details
([^\W\d_]) - Capturing group 1: any Unicode letter
\1{2,} - two or more repetitions of the same letter that is captured in Group 1.
The r'\1' replacement will only keep a single letter occurrence in the result.

regex: matching 3 consecutive words

I'm trying to see if a string contains 3 consecutive words (divided by spaces and without numbers), but the regex I have constructed does not seem to work:
print re.match('([a-zA-Z]+\b){3}', "123 test bla foo")
None
This should return true since the string contains the 3 words "test bla foo".
What is the best way to achieve this?
Do:
(?:[A-Za-z]+ ){2}[A-Za-z]+
(?:[A-Za-z]+ ){2}: the non-captured group (?:[A-Za-z]+ ) matches one or more alphabetic characters followed by space, {2} matches two such successive groups
[A-Za-z]+ matches one or more alphabetic character after the preceding two words, making the third word
Demo
If you want the words to be separated by any whitespace instead of just space:
(?:[A-Za-z]+\s){2}[A-Za-z]+
I use this to select the first words of a string:
^(?:[^\ ]+\ ){3}
I use the whitespaces for define and delimite each words.
[^\ ]+: minimum one char except whitespaces, followed by an whitespace \.
After you juste have to enter the number of words you want : {3}
It works very well.
this is a much better option. It includes words with hyphens or apostrophe, like "don't" or "mother-in-law"
([^\s]+ ){2}[^\s]+

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

Python regex keep a few more tokens

I am using the following regex in Python to keep words that do not contain non alphabetical characters:
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
The problem is that this regex does not keep words that I would like to keep such as the following:
Company,
months.
third-party
In other words I would like to keep words that are followed by a comma, a dot, or have a dash between two words.
Any ideas on how to implement this?
I tried adding something like |(?<!\S)[A-Za-z]+(?=\.(?!\S)) for the dots but it does not seem to be working.
Thanks !
EDIT:
Should match these:
On-line
. These
maintenance,
other.
. Our
Google
Should NOT match these:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
NY7xtb92dCTfvEjdmkDrUw==
$As_Of_12_31_20104206http://www.sec.gov/CIK0001393311instant2010-12-31T00:00:000001-01-01T00:00:00falsefalseArlington/S.Cooper
-Publisher
gaap_RealEstateAndAccumulatedDepreciationCostsCapitalizedSubsequentToAcquisitionCarryingCostsus
At the moment I am using the following python code to read a text file line by line:
find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))').findall
then i open the text file
contents = open("test.txt","r")
and I search for the words line by line for line in contents:
if find_words(line.lower()) != []: lineWords=find_words(line.lower())
print "The words in this line are: ", lineWords
using some word lists in the following way:
wanted1 = set(find_words(open('word_list_1.csv').read().lower()))
wanted2 = set(find_words(open('word_list_2.csv').read().lower()))
negators = set(find_words(open('word_list_3.csv').read().lower()))
i first want to get the valid words from the .txt file, and then check if these words belong in the word lists. the two steps are independent.
I propose this regex:
find_words = re.compile(r'(?:(?<=[^\w./-])|(?<=^))[A-Za-z]+(?:-[A-Za-z]+)*(?=\W|$)').findall
There are 3 parts from your initial regex that I changed:
Middle part:
[A-Za-z]+(?:-[A-Za-z]+)*
This allows hyphenated words.
The last part:
(?=\W|$)
This is a bit similar to (?!\S) except that it allows for characters that are not spaces like punctuations as well. So what happens is, this will allow a match if, after the word matched, the line ends, or there is a non-word character, in other words, there are no letters or numbers or underscores (if you don't want word_ to match word, then you will have to change \W to [a-zA-Z0-9]).
The first part (probably most complex):
(?:(?<=[^\w./-])|(?<=^))
It is composed of two parts itself which matches either (?<=[^\w./-]) or (?<=^). The second one allows a match if the line begins before the word to be matched. We cannot use (?<=[^\w./-]|^) because python's lookbehind from re cannot be of variable width (with [^\w./-] having a length of 1 and ^ a length of 0).
(?<=[^\w./-]) allows a match if, before the word, there are no word characters, periods, forward slashes or hyphens.
When broken down, the small parts are rather straightforward I think, but if there's anything you want some more elaboration, I can give more details.
This is not a regex task because you can not detect the words with regext.You must have a dictionary to check your words.
So i suggest use regex to split your string with non-alphabetical characters and check if the all of items exist in your dictionary.for example :
import re
words=re.split(r'\S+',my_string)
print all(i in my_dict for i in words if i)
As an alter native you can use nltk.corups as your dictionary :
from nltk.corpus import wordnet
words=re.split(r'\S+',my_string)
if all(wordnet.synsets(word) for i in words if i):
#do stuff
But if you want to use yourself word list you need to change your regex because its incorrect instead use re.split as preceding :
all_words = wanted1|wanted2|negators
with open("test.txt","r") as f :
for line in f :
for word in line.split():
words=re.split(r'\S+',word)
if all(i in all_words for i in words if i):
print word
Instead of using all sorts of complicated look-arounds, you can use \b to detect the boundary of words. This way, you can use e.g. \b[a-zA-Z]+(?:-[a-zA-Z]+)*\b
Example:
>>> p = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b"
>>> text = "This is some example text, with some multi-hyphen-words and invalid42 words in it."
>>> re.findall(p, text)
['This', 'is', 'some', 'example', 'text', 'with', 'some', 'multi-hyphen-words', 'and', 'words', 'in', 'it']
Update: Seems like this does not work too well, as it also detects fragments from URLs, e.g. www, sec and gov from http://www.sec.gov.
Instead, you might try this variant, using look-around explicitly stating the 'legal' characters:
r"""(?<![^\s("])[a-zA-Z]+(?:[-'][a-zA-Z]+)*(?=[\s.,:;!?")])"""
This seems to pass all your test-cases.
Let's dissect this regex:
(?<![^\s("]) - look-behind asserting that the word is preceeded by space, quote or parens, but e.g. not a number (using double-negation instead of positive look-behind so the first word is matched, too)
[a-zA-Z]+ - the first part of the word
(?:[-'][a-zA-Z]+)* - optionally more word-segments after a ' or -
(?=[\s.,:;!?")]) - look-ahead asserting that the word is followed by space, punctuation, quote or parens

Categories