I have a corpus of text documents, some of which will have a sequence of substrings. The first and last substrings are consistent, and mark the beginning and the end of the parts I want to replace. But, I would also like to delete/replace all substrings that exist between these first and last positions.
origSent = 'This is the sentence I am intending to edit'
Using the above as an example, how would I go about using 'the' as the start substring, and 'intending' as the end substring, deleting both in addition to the words that exist between them to make the following:
newSent = 'This is to edit'
You could use regex replacement here:
origSent = 'This is the sentence I am intending to edit'
newSent = re.sub(r'\bthe((?!\bthe\b).)*\bintending\b', '', origSent)
print(newSent)
This prints:
This is to edit
The "secret sauce" in the regex pattern is the tempered dot:
((?!\bthe\b).)*
This will consume all content which does not cross over another occurrence of the word the. This prevents matching on some earlier the before intending, which we don't want to do.
I would do this:
s_list = origSent.split()
newSent = ' '.join(s_list[:s_list.index('the')] + s_list[s_list.index('intending')+1:])
Hope this helps.
Related
Let's say I want to remove the word "tree" in every string in a Pandas dataframe column.
I would specify the substring(s) I want removed in a list. And then use replace and join on the column, as per below:
remove_list = ['\tree\s']
df['column'] = df['column'].str.replace('|'.join(remove_list ), '', regex=True).str.strip()
The reason I add a \s to tree is because there may be words like treehouse or backstreet. So I want to replace the word only if it ends with a space, so that I don't end up with words like "house" or "backst".
However I noticed that when I run this code, it misses "tree"s that are at the end of the string, because there is no space after it. Hence, it doesn't get removed. Any idea on how I can account for those?
Actually, I think the logic you want here is:
remove_list = ['tree']
terms = r'\s*\b(?:' + '|'.join(remove_list) + r')\b\s*'
df['column'] = df['column'].str.replace(terms, ' ', regex=True).str.strip()
Note that the regex pattern used above is, for a one word term list, \s*\b(?:tree)\b\s*. This will match only the exact word tree and not when tree appears as a substring of another word. We also attempt to grab any spaces on either side of the word. Then, we replace with just a single space, and trim the column to make sure there are no stray spaces at the start or end.
Edit:
To address the edge case put forth by #user2357112, consider the following input:
apple tree tree squirrel
In this case, the above solution would leave behind two spaces in between apple and squirrel. We can get around this by expanding our regex pattern to allow for multiple consecutive keyword matches:
terms = r'\s*\b(?:' + '|'.join(remove_list) + r')\b(?: \b(?:' + '|'.join(remove_list) + r'))*\b\s*'
df['column'] = df['column'].str.replace(terms, ' ', regex=True).str.strip()
Here we are using the following regex pattern:
\s*\b(?:tree)\b(?: \b(?:tree))*\b\s*
Basically, I start with inserting the word "brand" where I replace a single character in the word with an underscore and try and find all words that match the remaining characters. For example:
"b_and" would return: "band", "brand", "bland" .... etc.
I started with using re.sub to substitute the underscore in the character. But I'm really lost on where to go next. I only want words that are different by this underscore, either without the underscore or by replacing it with a letter. Like if the word "under" was to run through the list, i wouldn't want it to return "understood" or "thunder", just a single character difference. Any ideas would be great!
I tried replacing the character with every letter in the alphabet first, then back checking if that word is in the dictionary, but that took such a long time, I really want to know if there's a faster way
from itertools import chain
dictionary=open("Scrabble.txt").read().split('\n')
import re,string
#after replacing the word with "_", we find words in the dictionary that match the pattern
new=[]
for letter in string.ascii_lowercase:
underscore=re.sub('_', letter, word)
if underscore in dictionary:
new.append(underscore)
if new == []:
pass
else:
return new
IIUC this should do it. I'm doing it outside a function so you have a working example, but it's straightforward to do it inside a function.
string = 'band brand bland cat dand bant bramd branding blandisher'
word='brand'
new=[]
for n,letter in enumerate(word):
pattern=word[:n]+'\w?'+word[n+1:]
new.extend(re.findall(pattern,string))
new=list(set(new))
Output:
['bland', 'brand', 'bramd', 'band']
Explanation:
We're using regex to do what you're looking. In this case, in every iteration we're taking one letter out of "brand" and making the algorithm look for any word that matches. So it'll look for:
_rand, b_and, br_nd, bra_d, bran_
For the case of "b_and" the pattern is b\w?and, which means: find a word with b, then any character may or may not appear, and then 'and'.
Then it adds to the list all words that match.
Finally I remove duplicates with list(set(new))
Edit: forgot to add string vairable.
Here's a version of Juan C's answer that's a bit more Pythonic
import re
dictionary = open("Scrabble.txt").read().split('\n')
pattern = "b_and" # change to what you need
pattern = pattern.replace('_', '.?')
pattern += '\\b'
matching_words = [word for word in dictionary if re.match(pattern, word)]
Edit: fixed the regex according to your comment, quick explanation:
pattern = "b_and"
pattern = pattern.replace('_', '.?') # pattern is now b.?and, .? matches any one character (or none at all)
pattern += '\\b' # \b prevents matching with words like "bandit" or words longer than "b_and"
I'm looking to find words in a string that match a specific pattern.
Problem is, if the words are part of an email address, they should be ignored.
To simplify, the pattern of the "proper words" \w+\.\w+ - one or more characters, an actual period, and another series of characters.
The sentence that causes problem, for example, is a.a b.b:c.c d.d#e.e.e.
The goal is to match only [a.a, b.b, c.c] . With most Regexes I build, e.e returns as well (because I use some word boundary match).
For example:
>>> re.findall(r"(?:^|\s|\W)(?<!#)(\w+\.\w+)(?!#)\b", "a.a b.b:c.c d.d#e.e.e")
['a.a', 'b.b', 'c.c', 'e.e']
How can I match only among words that do not contain "#"?
I would definitely clean it up first and simplify the regex.
first we have
words = re.split(r':|\s', "a.a b.b:c.c d.d#e.e.e")
then filter out the words that have an # in them.
words = [re.search(r'^((?!#).)*$', word) for word in words]
Properly parsing email addresses with a regex is extremely hard, but for your simplified case, with a simple definition of word ~ \w\.\w and the email ~ any sequence that contains #, you might find this regex to do what you need:
>>> re.findall(r"(?:^|[:\s]+)(\w+\.\w+)(?=[:\s]+|$)", "a.a b.b:c.c d.d#e.e.e")
['a.a', 'b.b', 'c.c']
The trick here is not to focus on what comes in the next or previous word, but on what the word currently captured has to look like.
Another trick is in properly defining word separators. Before the word we'll allow multiple whitespaces, : and string start, consuming those characters, but not capturing them. After the word we require almost the same (except string end, instead of start), but we do not consume those characters - we use a lookahead assertion.
You may match the email-like substrings with \S+#\S+\.\S+ and match and capture your pattern with (\w+\.\w+) in all other contexts. Use re.findall to only return captured values and filter out empty items (they will be in re.findall results when there is an email match):
import re
rx = r"\S+#\S+\.\S+|(\w+\.\w+)"
s = "a.a b.b:c.c d.d#e.e.e"
res = filter(None, re.findall(rx, s))
print(res)
# => ['a.a', 'b.b', 'c.c']
See the Python demo.
See the regex demo.
I am using the following regex in Python to keep words that do not contain non alphabetical characters:
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
The problem is that this regex does not keep words that I would like to keep such as the following:
Company,
months.
third-party
In other words I would like to keep words that are followed by a comma, a dot, or have a dash between two words.
Any ideas on how to implement this?
I tried adding something like |(?<!\S)[A-Za-z]+(?=\.(?!\S)) for the dots but it does not seem to be working.
Thanks !
EDIT:
Should match these:
On-line
. These
maintenance,
other.
. Our
Google
Should NOT match these:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
NY7xtb92dCTfvEjdmkDrUw==
$As_Of_12_31_20104206http://www.sec.gov/CIK0001393311instant2010-12-31T00:00:000001-01-01T00:00:00falsefalseArlington/S.Cooper
-Publisher
gaap_RealEstateAndAccumulatedDepreciationCostsCapitalizedSubsequentToAcquisitionCarryingCostsus
At the moment I am using the following python code to read a text file line by line:
find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))').findall
then i open the text file
contents = open("test.txt","r")
and I search for the words line by line for line in contents:
if find_words(line.lower()) != []: lineWords=find_words(line.lower())
print "The words in this line are: ", lineWords
using some word lists in the following way:
wanted1 = set(find_words(open('word_list_1.csv').read().lower()))
wanted2 = set(find_words(open('word_list_2.csv').read().lower()))
negators = set(find_words(open('word_list_3.csv').read().lower()))
i first want to get the valid words from the .txt file, and then check if these words belong in the word lists. the two steps are independent.
I propose this regex:
find_words = re.compile(r'(?:(?<=[^\w./-])|(?<=^))[A-Za-z]+(?:-[A-Za-z]+)*(?=\W|$)').findall
There are 3 parts from your initial regex that I changed:
Middle part:
[A-Za-z]+(?:-[A-Za-z]+)*
This allows hyphenated words.
The last part:
(?=\W|$)
This is a bit similar to (?!\S) except that it allows for characters that are not spaces like punctuations as well. So what happens is, this will allow a match if, after the word matched, the line ends, or there is a non-word character, in other words, there are no letters or numbers or underscores (if you don't want word_ to match word, then you will have to change \W to [a-zA-Z0-9]).
The first part (probably most complex):
(?:(?<=[^\w./-])|(?<=^))
It is composed of two parts itself which matches either (?<=[^\w./-]) or (?<=^). The second one allows a match if the line begins before the word to be matched. We cannot use (?<=[^\w./-]|^) because python's lookbehind from re cannot be of variable width (with [^\w./-] having a length of 1 and ^ a length of 0).
(?<=[^\w./-]) allows a match if, before the word, there are no word characters, periods, forward slashes or hyphens.
When broken down, the small parts are rather straightforward I think, but if there's anything you want some more elaboration, I can give more details.
This is not a regex task because you can not detect the words with regext.You must have a dictionary to check your words.
So i suggest use regex to split your string with non-alphabetical characters and check if the all of items exist in your dictionary.for example :
import re
words=re.split(r'\S+',my_string)
print all(i in my_dict for i in words if i)
As an alter native you can use nltk.corups as your dictionary :
from nltk.corpus import wordnet
words=re.split(r'\S+',my_string)
if all(wordnet.synsets(word) for i in words if i):
#do stuff
But if you want to use yourself word list you need to change your regex because its incorrect instead use re.split as preceding :
all_words = wanted1|wanted2|negators
with open("test.txt","r") as f :
for line in f :
for word in line.split():
words=re.split(r'\S+',word)
if all(i in all_words for i in words if i):
print word
Instead of using all sorts of complicated look-arounds, you can use \b to detect the boundary of words. This way, you can use e.g. \b[a-zA-Z]+(?:-[a-zA-Z]+)*\b
Example:
>>> p = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b"
>>> text = "This is some example text, with some multi-hyphen-words and invalid42 words in it."
>>> re.findall(p, text)
['This', 'is', 'some', 'example', 'text', 'with', 'some', 'multi-hyphen-words', 'and', 'words', 'in', 'it']
Update: Seems like this does not work too well, as it also detects fragments from URLs, e.g. www, sec and gov from http://www.sec.gov.
Instead, you might try this variant, using look-around explicitly stating the 'legal' characters:
r"""(?<![^\s("])[a-zA-Z]+(?:[-'][a-zA-Z]+)*(?=[\s.,:;!?")])"""
This seems to pass all your test-cases.
Let's dissect this regex:
(?<![^\s("]) - look-behind asserting that the word is preceeded by space, quote or parens, but e.g. not a number (using double-negation instead of positive look-behind so the first word is matched, too)
[a-zA-Z]+ - the first part of the word
(?:[-'][a-zA-Z]+)* - optionally more word-segments after a ' or -
(?=[\s.,:;!?")]) - look-ahead asserting that the word is followed by space, punctuation, quote or parens
I have strings that are of the form below:
<p>The is a string.</p>
<em>This is another string.</em>
They are read in from a text file one line at a time. I want to separate these into words. For that I am just splitting the string using split().
Now I have a set of words but the first word will be <p>The rather than The. Same for the other words that have <> next to them. I want to remove the <..> from the words.
I'd like to do this in one line. What I mean is I want to pass as a parameter something of the form <*> like I would on the command line. I was thinking of using the replace() function to try to do this, but I am not sure how the replace() function parameter would look like.
For example, how could I change <..> below in a way that it will mean that I want to include anything that is between < and >:
x = x.replace("<..>", "")
Unfortunately, str.replace does not support Regex patterns. You need to use re.sub for this:
>>> from re import sub
>>> sub("<[^>]*>", "", "<p>The is a string.</p>")
'The is a string.'
>>> sub("<[^>]*>", "", "<em>This is another string.</em>")
'This is another string.'
>>>
[^>]* matches zero or more characters that are not >.
No Need for a 2-Step Solution
You don't need to 1. Split then 2. Replace. The two solutions below show you how to do it with one single step.
Option 1: Match All Instead of Splitting
Match All and Split are Two Sides of the Same Coin, and in this case it is safer to match all:
<[^>]+>|(\w+)
The words will be in Group 1.
Use it like this:
subject = '<p>The is a string.</p><em>This is another string.</em>'
regex = re.compile(r'<[^>]+>|(\w+)')
matches = [group for group in re.findall(regex, subject) if group]
print(matches)
Output
['The', 'is', 'a', 'string', 'This', 'is', 'another', 'string']
Discussion
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
The left side of the alternation | matches complete <tags>. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...
Option 2: One Single Split
<[^>]+>|[ .]
On the left side of the |, we use <complete tags> as a split delimiter. On the right side, we use a space character or a period.
Output
This
is
a
string