Regex to Identify Fixed character alphanumeric word from text in python - python

I have a text file from which i am trying to remove alpha-numeric word of seven characters.
Text1: " I have to remove the following word, **WORD123**, from the text given"
Text2: " I have to remove the following word, **WORD001**, the text given"
So far i tried '\b[A-Za-z0-9]\b' but it doesn't works.
Also , can we add a functionality that it picks only those words which is succeeded by "from"(not actual word, just an example).In the above example it should only pick WORD123, and not WORD001 as the later one is not succeeded by FROM.

You may use re.sub here, e.g.
inp = "I have to remove the following word, WORD123, FROM the text given"
out = re.sub(r'\s*\b[A-Za-z0-9]{7}\b[^\w]*(?=\bfrom)', '', inp, flags=re.IGNORECASE)
print(out)
This prints:
I have to remove the following word,from the text given
Note that the above regex replacement does not match/affect the second sample input sentence you gave, since the 7 letter word is lacking the keyword from as the next word.

Related

how do i find a character in a text and then copy the word that had that character python

i want to extract a word out of a string based on what character it got, for Example:
string: I WANT TO EAT cheese in zeven11
Extract all words with 11 in it
extracted: Zeven11
i try find() method but then i only get a on number of the word
Maybe something like this:
for word in string.split():
if '11' in word: print(word.capitalize()) # first letter?

Regex : catch one character but not more

I'm trying to find a regexp that catches all instances that contain one and only one \n and any number of (space), in a string that might also contain instances with multiple \n. So, for instance (with spaces denoted with _):
Should be caught:
\n
_\n
\n_
_\n_
Should *not* be caught, not even the first \n:
_
___
\n\n\n\n
\n\n\n_\n\n
_\n\n
\n\n_
_\n\n_
_\n\n_\n
\n_\n_
_\n_\n
_\n\n_\n_
___\n__\n and so on...
(Using re in pyton3 on Windows10)
I'm trying to find a regexp that catches all instances that contain one and only one \n and any number of (space), in a string that might also contain instances with multiple \n. So, for instance (with spaces denoted with _):
Should be caught:
\n
_\n
\n_
_\n_
Should *not* be caught, not even the first \n:
_
___
\n\n\n\n
\n\n\n_\n\n
_\n\n
\n\n_
_\n\n_
_\n\n_\n
\n_\n_
_\n_\n
_\n\n_\n_
___\n__\n and so on...
(Using re in pyton3 on Windows10)
Edit to clarify the context: I'm parsing the text of a web page and I have a block of text in a string, that looks like that:
Word word word. Word word word word word. \n Word word word word word word. Word word word word. \n\n \nWord word word word word. \nWord word word. Word word word.
In the subsequent steps of my code, I'm using a function that gets rid of any \n, so I want to detect where they are before using this function, so I can keep them (by replacing them temporarily with special characters that won't disappear). But as you can see, I have two cases :
1) Multiple \n indicate a break of paragraphs, but I have no way to be sure that they follow each other without spaces or tabs between them. I want to catch them to replace them with a special character (like § for instance) that will let me know later where to put back multiple \n. It only matters that I know there are 2 or more \n, not how many there are. At the moment, I'm using this (but please do tell me if there is a bug):
text = re.sub(r"[ \t]*(?:\n[ \t]*){2,}", "$", text)
2) Single \n indicate a line break within a paragraph. These are what I want to single out, without catching the instances of the previous case. Again, it's to replace them with a special character (say |) to put it back later:
text = re.sub(r" the_regex_I'm_looking_for ", "|", text)
(I know I could do the first replacement, and then search for the remaining \n, but for reasons that would be largely irrelevant here and long to explain, I can't.)
2nd edit: So, for instance, the desired result in this case would be:
Word word word. Word word word word word. | Word word word word word word. Word word word word. $ Word word word word word. | Word word word. Word word word.
(I'd rather have no spaces before and after the § and the |, but here I'm forced to put them for the bold formatting of StackOverflow, if I don't I get something like **$**that.)
Would the following pattern suit you?
import regex as re
StrVal = r'Word word word. Word word word word word. \n Word word word word word word. Word word word word. \n\n \nWord word word word word. \nWord word word. Word word word.'
StrVal = re.sub(r'(?<!\\n\s*)\s*\\n\s*(?!\s*\\n)', '|', StrVal)
print(StrVal)
Returns:
Word word word. Word word word word word.|Word word word word word word. Word word word word. \n\n \nWord word word word word.|Word word word. Word word word.
So instead of re module, I referenced regex module to make use of non-fixed width quantifier in the negative lookbehind, something re would not allow. So also patterns like \n \n\n \n get no substitution.
Check this demo whether it is ok for you. I have used space instead of "_".
import re
pattern = '^ *\n *$'
test_string = "\n\n "
result = re.findall(pattern, test_string)
print(result)
NB: I used '^\s*\n\s*' but it will not work as \s is equivalent to [\t\n\r\f\v]. so I have used space ' ' character

regex how to remove only a word including some particular letters

I'm looking for regex to get the result below.
The original sentence is:
txt="そう言え"
txt="そう言う"
and expected result is:
output="そう"
output="そう"
What I want to do here is to remove a word consists of two letters which includes character "言".
I tried putput = re.sub(r"^(?=.*言).*$", "", txt) in python but it actually removes the entire sentence. What do I do?
You can use a pattern that matches 言 followed by another word (denoted by \w), so that re.sub can replace the match with an empty string:
re.sub(r"言\w", "", txt)

Python regex keep a few more tokens

I am using the following regex in Python to keep words that do not contain non alphabetical characters:
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
The problem is that this regex does not keep words that I would like to keep such as the following:
Company,
months.
third-party
In other words I would like to keep words that are followed by a comma, a dot, or have a dash between two words.
Any ideas on how to implement this?
I tried adding something like |(?<!\S)[A-Za-z]+(?=\.(?!\S)) for the dots but it does not seem to be working.
Thanks !
EDIT:
Should match these:
On-line
. These
maintenance,
other.
. Our
Google
Should NOT match these:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
NY7xtb92dCTfvEjdmkDrUw==
$As_Of_12_31_20104206http://www.sec.gov/CIK0001393311instant2010-12-31T00:00:000001-01-01T00:00:00falsefalseArlington/S.Cooper
-Publisher
gaap_RealEstateAndAccumulatedDepreciationCostsCapitalizedSubsequentToAcquisitionCarryingCostsus
At the moment I am using the following python code to read a text file line by line:
find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))').findall
then i open the text file
contents = open("test.txt","r")
and I search for the words line by line for line in contents:
if find_words(line.lower()) != []: lineWords=find_words(line.lower())
print "The words in this line are: ", lineWords
using some word lists in the following way:
wanted1 = set(find_words(open('word_list_1.csv').read().lower()))
wanted2 = set(find_words(open('word_list_2.csv').read().lower()))
negators = set(find_words(open('word_list_3.csv').read().lower()))
i first want to get the valid words from the .txt file, and then check if these words belong in the word lists. the two steps are independent.
I propose this regex:
find_words = re.compile(r'(?:(?<=[^\w./-])|(?<=^))[A-Za-z]+(?:-[A-Za-z]+)*(?=\W|$)').findall
There are 3 parts from your initial regex that I changed:
Middle part:
[A-Za-z]+(?:-[A-Za-z]+)*
This allows hyphenated words.
The last part:
(?=\W|$)
This is a bit similar to (?!\S) except that it allows for characters that are not spaces like punctuations as well. So what happens is, this will allow a match if, after the word matched, the line ends, or there is a non-word character, in other words, there are no letters or numbers or underscores (if you don't want word_ to match word, then you will have to change \W to [a-zA-Z0-9]).
The first part (probably most complex):
(?:(?<=[^\w./-])|(?<=^))
It is composed of two parts itself which matches either (?<=[^\w./-]) or (?<=^). The second one allows a match if the line begins before the word to be matched. We cannot use (?<=[^\w./-]|^) because python's lookbehind from re cannot be of variable width (with [^\w./-] having a length of 1 and ^ a length of 0).
(?<=[^\w./-]) allows a match if, before the word, there are no word characters, periods, forward slashes or hyphens.
When broken down, the small parts are rather straightforward I think, but if there's anything you want some more elaboration, I can give more details.
This is not a regex task because you can not detect the words with regext.You must have a dictionary to check your words.
So i suggest use regex to split your string with non-alphabetical characters and check if the all of items exist in your dictionary.for example :
import re
words=re.split(r'\S+',my_string)
print all(i in my_dict for i in words if i)
As an alter native you can use nltk.corups as your dictionary :
from nltk.corpus import wordnet
words=re.split(r'\S+',my_string)
if all(wordnet.synsets(word) for i in words if i):
#do stuff
But if you want to use yourself word list you need to change your regex because its incorrect instead use re.split as preceding :
all_words = wanted1|wanted2|negators
with open("test.txt","r") as f :
for line in f :
for word in line.split():
words=re.split(r'\S+',word)
if all(i in all_words for i in words if i):
print word
Instead of using all sorts of complicated look-arounds, you can use \b to detect the boundary of words. This way, you can use e.g. \b[a-zA-Z]+(?:-[a-zA-Z]+)*\b
Example:
>>> p = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b"
>>> text = "This is some example text, with some multi-hyphen-words and invalid42 words in it."
>>> re.findall(p, text)
['This', 'is', 'some', 'example', 'text', 'with', 'some', 'multi-hyphen-words', 'and', 'words', 'in', 'it']
Update: Seems like this does not work too well, as it also detects fragments from URLs, e.g. www, sec and gov from http://www.sec.gov.
Instead, you might try this variant, using look-around explicitly stating the 'legal' characters:
r"""(?<![^\s("])[a-zA-Z]+(?:[-'][a-zA-Z]+)*(?=[\s.,:;!?")])"""
This seems to pass all your test-cases.
Let's dissect this regex:
(?<![^\s("]) - look-behind asserting that the word is preceeded by space, quote or parens, but e.g. not a number (using double-negation instead of positive look-behind so the first word is matched, too)
[a-zA-Z]+ - the first part of the word
(?:[-'][a-zA-Z]+)* - optionally more word-segments after a ' or -
(?=[\s.,:;!?")]) - look-ahead asserting that the word is followed by space, punctuation, quote or parens

extract English words from string in python

I have a document that each line is a string. It might contain digits, non-English letters and words, symbols(such as ! and *). I want to extract the English words from each line(English words are separated by space).
My code is the following, which is the map function of my map-reduce job. However, based on the final result, this mapper function only produces letters(such as a,b,c) frequency count. Can anyone help me find the bug? Thanks
import sys
import re
for line in sys.stdin:
line = re.sub("[^A-Za-z]", "", line.strip())
line = line.lower()
words = ' '.join(line.split())
for word in words:
print '%s\t%s' % (word, 1)
You've actually got two problems.
First, this:
line = re.sub("[^A-Za-z]", "", line.strip())
This removes all non-letters from the line. Which means you no longer have any spaces to split on, and therefore no way to separate it into words.
Next, even if you didn't do that, you do this:
words = ' '.join(line.split())
This doesn't give you a list of words, this gives you a single string, with all those words concatenated back together. (Basically, the original line with all runs of whitespace converted into a single space.)
So, in the next line, when you do this:
for word in words:
You're iterating over a string, which means each word is a single character. Because that's what strings are: iterables of characters.
If you want each word (as your variable names imply), you already had those, the problem is that you joined them back into a string. Just don't do this:
words = line.split()
for word in words:
Or, if you want to strip out things besides letters and whitespace, use a regular expression that strips out everything besides letters and whitespace, not one that strips out everything besides letters, including whitespace:
line = re.sub(r"[^A-Za-z\s]", "", line.strip())
words = line.split()
for word in words:
However, that pattern is still probably not what you want. Do you really want to turn 'abc1def' into the single string 'abcdef', or into the two strings 'abc' and 'def'? You probably want either this:
line = re.sub(r"[^A-Za-z]", " ", line.strip())
words = line.split()
for word in words:
… or just:
words = re.split(r"[^A-Za-z]", line.strip())
for word in words:
There are two issues here:
line = re.sub("[^A-Za-z]", "", line.strip()) will remove all the non-characters, making it hard to split word in the subsequent stage. One alternatively solution is like this words = re.findall('[A-Za-z]', line)
As mentioned by #abarnert, in the existing code words is a string, for word in words will iterate each letter. To get words as a list of words, you can follow 1.

Categories