I have to write a code for searching regular expression from an excel sheet which has sentences grouped together. I have managed to find the key words representing each sentence. When i run the below mention code it finds only one key word from one cell and moves to next cell. I have tried to display the requirement in the table
\bphrase\W+(?:\w+\W+){0,6}?one\b|\bphrase\W+(?:\w+\W+){0,6}?two\b|\bphrase\W+(?:\w+\W+){0,6}?three\b|\bphrase\W+(?:\w+\W+){0,6}?four\b|
The regex:
\b(phrase)\b\W+(?:\w+\W+){0,6}?\b(one|two|three|four)\b
\b(phrase)\b matches phrase on a word boundary.
W+: matches one or more non-word characters (typically spaces).
(?:\w+\W+){0,6}? Matches between 0 and 6 times, as few times as possible, one or more word characters followed by one or more non-word characters.
\b(one|two|three|four)\b Matches one, two, three or four on a word boundary.
The code:
import re
text = "This sentence has phrase one and phrase word word two and phrase word three and phrase four phrase too many words too many words too many words four again."
l = [m[1] + ' ' + m[2] for m in re.finditer(r'\b(phrase)\b\W+(?:\w+\W+){0,6}?\b(one|two|three|four)\b', text)]
print(l)
Prints:
['phrase one', 'phrase two', 'phrase three', 'phrase four']
Related
I'm trying to find a regexp that catches all instances that contain one and only one \n and any number of (space), in a string that might also contain instances with multiple \n. So, for instance (with spaces denoted with _):
Should be caught:
\n
_\n
\n_
_\n_
Should *not* be caught, not even the first \n:
_
___
\n\n\n\n
\n\n\n_\n\n
_\n\n
\n\n_
_\n\n_
_\n\n_\n
\n_\n_
_\n_\n
_\n\n_\n_
___\n__\n and so on...
(Using re in pyton3 on Windows10)
I'm trying to find a regexp that catches all instances that contain one and only one \n and any number of (space), in a string that might also contain instances with multiple \n. So, for instance (with spaces denoted with _):
Should be caught:
\n
_\n
\n_
_\n_
Should *not* be caught, not even the first \n:
_
___
\n\n\n\n
\n\n\n_\n\n
_\n\n
\n\n_
_\n\n_
_\n\n_\n
\n_\n_
_\n_\n
_\n\n_\n_
___\n__\n and so on...
(Using re in pyton3 on Windows10)
Edit to clarify the context: I'm parsing the text of a web page and I have a block of text in a string, that looks like that:
Word word word. Word word word word word. \n Word word word word word word. Word word word word. \n\n \nWord word word word word. \nWord word word. Word word word.
In the subsequent steps of my code, I'm using a function that gets rid of any \n, so I want to detect where they are before using this function, so I can keep them (by replacing them temporarily with special characters that won't disappear). But as you can see, I have two cases :
1) Multiple \n indicate a break of paragraphs, but I have no way to be sure that they follow each other without spaces or tabs between them. I want to catch them to replace them with a special character (like § for instance) that will let me know later where to put back multiple \n. It only matters that I know there are 2 or more \n, not how many there are. At the moment, I'm using this (but please do tell me if there is a bug):
text = re.sub(r"[ \t]*(?:\n[ \t]*){2,}", "$", text)
2) Single \n indicate a line break within a paragraph. These are what I want to single out, without catching the instances of the previous case. Again, it's to replace them with a special character (say |) to put it back later:
text = re.sub(r" the_regex_I'm_looking_for ", "|", text)
(I know I could do the first replacement, and then search for the remaining \n, but for reasons that would be largely irrelevant here and long to explain, I can't.)
2nd edit: So, for instance, the desired result in this case would be:
Word word word. Word word word word word. | Word word word word word word. Word word word word. $ Word word word word word. | Word word word. Word word word.
(I'd rather have no spaces before and after the § and the |, but here I'm forced to put them for the bold formatting of StackOverflow, if I don't I get something like **$**that.)
Would the following pattern suit you?
import regex as re
StrVal = r'Word word word. Word word word word word. \n Word word word word word word. Word word word word. \n\n \nWord word word word word. \nWord word word. Word word word.'
StrVal = re.sub(r'(?<!\\n\s*)\s*\\n\s*(?!\s*\\n)', '|', StrVal)
print(StrVal)
Returns:
Word word word. Word word word word word.|Word word word word word word. Word word word word. \n\n \nWord word word word word.|Word word word. Word word word.
So instead of re module, I referenced regex module to make use of non-fixed width quantifier in the negative lookbehind, something re would not allow. So also patterns like \n \n\n \n get no substitution.
Check this demo whether it is ok for you. I have used space instead of "_".
import re
pattern = '^ *\n *$'
test_string = "\n\n "
result = re.findall(pattern, test_string)
print(result)
NB: I used '^\s*\n\s*' but it will not work as \s is equivalent to [\t\n\r\f\v]. so I have used space ' ' character
I have dataset which contains comments of people in Persian and Arabic. Some comments contain words like عاااالی which is not a real word and the right word is actually عالی. It's like using woooooooow! instead of WoW!.
My intention is to find these words and remove all extra alphabets. the only refrence I found is the code below which removes the words with repeated alphabets:
import re
p = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
s = "df\nAll aaaaaab the best 8965\nUS issssss is 123 good \nqqqq qwerty 1 poiks\nlkjh ggggqwe 1234 aqwe iphone5224s"
strs = s.split("\n")
print([p.sub("", x).strip() for x in strs])
I just need to replace the word with the one that has removed the extra repeated alphabets. you can use this sentence as a test case:
سلاااااام چطووووورین؟ من خیلی گشتم ولی مثل این کیفیت اصلاااااا ندیدممممم.
It has to be like this:
سلام چطورین؟ من خیلی گشتم ولی مثل این کیفیت اصلا ندیدم
please consider that more than 3 repeats are not acceptable.
You may use
re.sub(r'([^\W\d_])\1{2,}', r'\1', s)
It will replace chunks of identical consecutive letters with their single occurrence.
See the regex demo.
Details
([^\W\d_]) - Capturing group 1: any Unicode letter
\1{2,} - two or more repetitions of the same letter that is captured in Group 1.
The r'\1' replacement will only keep a single letter occurrence in the result.
I have a bunch of documents and I'm interested in finding mentions of clinical trials. These are always denoted by the letters being in all caps (e.g. ASPIRE). I want to match any word in all caps, greater than three letters. I also want the surrounding +- 4 words for context.
Below is what I currently have. It kind of works, but fails the test below.
import re
pattern = '((?:\w*\s*){,4})\s*([A-Z]{4,})\s*((?:\s*\w*){,4})'
line = r"Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY."
re.findall(pattern, line)
You may use this code in python that does it in 2 steps. First we split input by 4+ letter capital words and then we find upto 4 words on either side of match.
import re
str = 'Lorem IPSUM is simply DUMMY text of the printing and typesetting INDUSTRY'
re1 = r'\b([A-Z]{4,})\b'
re2 = r'(?:\s*\w+\b){,4}'
arr = re.split(re1, str)
result = []
for i in range(len(arr)):
if i % 2:
result.append( (re.search(re2, arr[i-1]).group(), arr[i], re.search(re2, arr[i+1]).group()) )
print result
Code Demo
Output:
[('Lorem', 'IPSUM', ' is simply'), (' is simply', 'DUMMY', ' text of the printing'), (' text of the printing', 'INDUSTRY', '')]
Would the following regex works for you?
(\b\w+\b\W*){,4}[A-Z]{3,}\W*(\b\w+\b\W*){,4}
Tested here: https://regex101.com/r/nTzLue/1/
On the left side you could match any word character \w+ one or more times followed by any non word characters \W+ one or more times. Combine those two in a non capturing group and repeat that 4 times {4} like (?:\w+\W+){4}
Then capture 3 or more uppercase characters in a group ([A-Z]{3,}).
Or the right side you could then turn the matching of the word and non word characters around of what you match on the left side (?:\W+\w+){4}
(?:\w+\W+){4}([A-Z]{3,})(?:\W+\w+){4}
The captured group will contain your uppercase word and the on capturing groups will contain the surrounding words.
This should do the job:
pattern = '(?:(\w+ ){4})[A-Z]{3}(\w+ ){5}'
I am trying to make a regular expression to parse a string, and insert two new lines for each period that is sandwiched between two letters.
For example:
string_var = 'This is my first sentence.This is my second sentence.This is my third sentence. This is my fourth sentence.This is my fifth sentence.'
Each sentence except the fourth sentence ends without a space between the last word of the sentence, a period, and the first word of the next sentence.I would like to have the output:
string_var = 'This is my first sentence.
This is my second sentence.
This is my third sentence. This is my fourth sentence.
This is my fifth sentence.'
Does anyone know how I can accomplish this?
This adds two newlines to periods that are surrounded by two letters (technically any alphanumeric character or an underscore):
re.sub(r'(?<=\w)\.(?=\w)', ".\n\n", string_var)
This makes use of lookarounds: it takes a look at every period, and only matches if the character before it is a letter and the character after it is a letter. These matchers just look, and are not replaced by the replacement text.
I am using the following regex in Python to keep words that do not contain non alphabetical characters:
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
The problem is that this regex does not keep words that I would like to keep such as the following:
Company,
months.
third-party
In other words I would like to keep words that are followed by a comma, a dot, or have a dash between two words.
Any ideas on how to implement this?
I tried adding something like |(?<!\S)[A-Za-z]+(?=\.(?!\S)) for the dots but it does not seem to be working.
Thanks !
EDIT:
Should match these:
On-line
. These
maintenance,
other.
. Our
Google
Should NOT match these:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
NY7xtb92dCTfvEjdmkDrUw==
$As_Of_12_31_20104206http://www.sec.gov/CIK0001393311instant2010-12-31T00:00:000001-01-01T00:00:00falsefalseArlington/S.Cooper
-Publisher
gaap_RealEstateAndAccumulatedDepreciationCostsCapitalizedSubsequentToAcquisitionCarryingCostsus
At the moment I am using the following python code to read a text file line by line:
find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))').findall
then i open the text file
contents = open("test.txt","r")
and I search for the words line by line for line in contents:
if find_words(line.lower()) != []: lineWords=find_words(line.lower())
print "The words in this line are: ", lineWords
using some word lists in the following way:
wanted1 = set(find_words(open('word_list_1.csv').read().lower()))
wanted2 = set(find_words(open('word_list_2.csv').read().lower()))
negators = set(find_words(open('word_list_3.csv').read().lower()))
i first want to get the valid words from the .txt file, and then check if these words belong in the word lists. the two steps are independent.
I propose this regex:
find_words = re.compile(r'(?:(?<=[^\w./-])|(?<=^))[A-Za-z]+(?:-[A-Za-z]+)*(?=\W|$)').findall
There are 3 parts from your initial regex that I changed:
Middle part:
[A-Za-z]+(?:-[A-Za-z]+)*
This allows hyphenated words.
The last part:
(?=\W|$)
This is a bit similar to (?!\S) except that it allows for characters that are not spaces like punctuations as well. So what happens is, this will allow a match if, after the word matched, the line ends, or there is a non-word character, in other words, there are no letters or numbers or underscores (if you don't want word_ to match word, then you will have to change \W to [a-zA-Z0-9]).
The first part (probably most complex):
(?:(?<=[^\w./-])|(?<=^))
It is composed of two parts itself which matches either (?<=[^\w./-]) or (?<=^). The second one allows a match if the line begins before the word to be matched. We cannot use (?<=[^\w./-]|^) because python's lookbehind from re cannot be of variable width (with [^\w./-] having a length of 1 and ^ a length of 0).
(?<=[^\w./-]) allows a match if, before the word, there are no word characters, periods, forward slashes or hyphens.
When broken down, the small parts are rather straightforward I think, but if there's anything you want some more elaboration, I can give more details.
This is not a regex task because you can not detect the words with regext.You must have a dictionary to check your words.
So i suggest use regex to split your string with non-alphabetical characters and check if the all of items exist in your dictionary.for example :
import re
words=re.split(r'\S+',my_string)
print all(i in my_dict for i in words if i)
As an alter native you can use nltk.corups as your dictionary :
from nltk.corpus import wordnet
words=re.split(r'\S+',my_string)
if all(wordnet.synsets(word) for i in words if i):
#do stuff
But if you want to use yourself word list you need to change your regex because its incorrect instead use re.split as preceding :
all_words = wanted1|wanted2|negators
with open("test.txt","r") as f :
for line in f :
for word in line.split():
words=re.split(r'\S+',word)
if all(i in all_words for i in words if i):
print word
Instead of using all sorts of complicated look-arounds, you can use \b to detect the boundary of words. This way, you can use e.g. \b[a-zA-Z]+(?:-[a-zA-Z]+)*\b
Example:
>>> p = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b"
>>> text = "This is some example text, with some multi-hyphen-words and invalid42 words in it."
>>> re.findall(p, text)
['This', 'is', 'some', 'example', 'text', 'with', 'some', 'multi-hyphen-words', 'and', 'words', 'in', 'it']
Update: Seems like this does not work too well, as it also detects fragments from URLs, e.g. www, sec and gov from http://www.sec.gov.
Instead, you might try this variant, using look-around explicitly stating the 'legal' characters:
r"""(?<![^\s("])[a-zA-Z]+(?:[-'][a-zA-Z]+)*(?=[\s.,:;!?")])"""
This seems to pass all your test-cases.
Let's dissect this regex:
(?<![^\s("]) - look-behind asserting that the word is preceeded by space, quote or parens, but e.g. not a number (using double-negation instead of positive look-behind so the first word is matched, too)
[a-zA-Z]+ - the first part of the word
(?:[-'][a-zA-Z]+)* - optionally more word-segments after a ' or -
(?=[\s.,:;!?")]) - look-ahead asserting that the word is followed by space, punctuation, quote or parens