regex how to remove only a word including some particular letters - python

I'm looking for regex to get the result below.
The original sentence is:
txt="そう言え"
txt="そう言う"
and expected result is:
output="そう"
output="そう"
What I want to do here is to remove a word consists of two letters which includes character "言".
I tried putput = re.sub(r"^(?=.*言).*$", "", txt) in python but it actually removes the entire sentence. What do I do?

You can use a pattern that matches 言 followed by another word (denoted by \w), so that re.sub can replace the match with an empty string:
re.sub(r"言\w", "", txt)

Related

Need Regex that matches all patterns with format as `{word}{.,#}{word}` with strict matching

So I have been trying to construct a regex that can detect the pattern {word}{.,#}{word} and seperate it into [word,',' (or '.','#'), word].
But i am not able to create one that does strict matching for this pattern and ignores everything else.
I used the following regex
r"[\w]+|[.]"
this one is doing well , but it doesnt do strict matching, as in if (,, # or .) characters dont occur in text, it will still give me words, which i dont want.
I would like to have a regex which strictly matches the above pattern and gives me the splits(using re.findall) and if not returns the whole word as it is.
Please Note: word on either side of the {,.#} , both words are not strictly to be present but atleast one should be present
Some example text for reference:
no.16 would give me ['no','.','16']
#400 would give me ['#,'400']
word1.word2 would give me ['word1','.','word2']
Looking forward to some help and assistance from all regex gurus out there
EDIT:
I forgot to add this. #viktor's version works as needed with only one problem, It ignores ALL other words during re.findall
eg. ONE TWO THREE #400 with the viktor's regex gives me ['','#','400']
but what was expected was ['ONE','TWO','THREE','#',400]
this can be done with NLTK or spacy, but use of those is a limitation.
I suggest using
(\w+)?([.,#])((?(1)\w*|\w+))
See the regex demo.
Details
(\w+)? - An optional group #1: one or more word chars
([.,#]) - Group #2: ., , or #
((?(1)\w*|\w+)) - Group #3: if Group 1 matched, match zero or more word chars (the word is optional on the right side then), else, match one or more word chars (there must be a word on the right side of the punctuation chars since there is no word before them).
See the Python demo:
import re
pattern = re.compile(r'(\w+)?([.,#])((?(1)\w*|\w+))')
strings = ['no.16', '#400', 'word1.word2', 'word', '123']
for s in strings:
print(s, ' -> ', pattern.findall(s))
Output:
no.16 -> [('no', '.', '16')]
#400 -> [('', '#', '400')]
word1.word2 -> [('word1', '.', 'word2')]
word -> []
123 -> []
The answer to your edit is
if re.search(r'\w[.,#]|[.,#]\w', text):
print( re.findall(r'[.,#]|[^\s.,#]+', text) )
If there is a word char, then any of the three punctuation symbols, and then a word char again in the input string, you can find and extract all occurrences of the [.,#]|[^\s.,#]+ pattern, namely a ., , or #, or one or more occurrences of any one or more chars other than whitespace, ., , and #.
I hope this code will solve your problem if you want to split the string by any of the mentioned special characters:
a='no.16'
b='#400'
c='word1.word2'
lst=[a, b, c]
for elem in lst:
result= re.split('(\.|#|,)',elem)
while('' in result):
result.remove('')
print(result)
You could do something like this:
import re
str = "no.16"
pattern = re.compile(r"(\w+)([.|#])(\w+)")
result = list(filter(None, pattern.split(str)))
The list(filter(...)) part is needed to remove the empty strings that split returns (see Python - re.split: extra empty strings that the beginning and end list).
However, this will only work if your string only contains these two words separated by one of the delimiters specified by you. If there is additional content before or after the pattern, this will also be returned by split.

Regex to Identify Fixed character alphanumeric word from text in python

I have a text file from which i am trying to remove alpha-numeric word of seven characters.
Text1: " I have to remove the following word, **WORD123**, from the text given"
Text2: " I have to remove the following word, **WORD001**, the text given"
So far i tried '\b[A-Za-z0-9]\b' but it doesn't works.
Also , can we add a functionality that it picks only those words which is succeeded by "from"(not actual word, just an example).In the above example it should only pick WORD123, and not WORD001 as the later one is not succeeded by FROM.
You may use re.sub here, e.g.
inp = "I have to remove the following word, WORD123, FROM the text given"
out = re.sub(r'\s*\b[A-Za-z0-9]{7}\b[^\w]*(?=\bfrom)', '', inp, flags=re.IGNORECASE)
print(out)
This prints:
I have to remove the following word,from the text given
Note that the above regex replacement does not match/affect the second sample input sentence you gave, since the 7 letter word is lacking the keyword from as the next word.

How can I use regex to search unicode texts and find words that contain repeated alphabets?

I have dataset which contains comments of people in Persian and Arabic. Some comments contain words like عاااالی which is not a real word and the right word is actually عالی. It's like using woooooooow! instead of WoW!.
My intention is to find these words and remove all extra alphabets. the only refrence I found is the code below which removes the words with repeated alphabets:
import re
p = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
s = "df\nAll aaaaaab the best 8965\nUS issssss is 123 good \nqqqq qwerty 1 poiks\nlkjh ggggqwe 1234 aqwe iphone5224s"
strs = s.split("\n")
print([p.sub("", x).strip() for x in strs])
I just need to replace the word with the one that has removed the extra repeated alphabets. you can use this sentence as a test case:
سلاااااام چطووووورین؟ من خیلی گشتم ولی مثل این کیفیت اصلاااااا ندیدممممم.
It has to be like this:
سلام چطورین؟ من خیلی گشتم ولی مثل این کیفیت اصلا ندیدم
please consider that more than 3 repeats are not acceptable.
You may use
re.sub(r'([^\W\d_])\1{2,}', r'\1', s)
It will replace chunks of identical consecutive letters with their single occurrence.
See the regex demo.
Details
([^\W\d_]) - Capturing group 1: any Unicode letter
\1{2,} - two or more repetitions of the same letter that is captured in Group 1.
The r'\1' replacement will only keep a single letter occurrence in the result.

Match a word ending with pattern in a sentence

How to find word(S) in a sentence that end with a pattern using regex
I have list of patterns I want to match within a sentence
For example
my_list = ['one', 'this']
sentence = 'Someone dothis onesome thisis'
Result should return only words that end with items from my_list
['Someone','dothis'] only
since I do not want to match onesome or thisis
You can end your pattern with the word boundary metacharacter \b. It will match anything that is not a word character, including the end of the string. So, in that specific case, the pattern would be (one|this)\b.
To actually create a regex from your my_list variable, assuming that no reserved characters are present, you can do:
import re
def words_end_with(sentence, my_list):
return re.findall(r"({})\b".format("|".join(my_list)), sentence)
If you're using Python 3.6+, you can also use an f-string, to do this formatting inside the string itself:
import re
def words_end_with(sentence, my_list):
return re.findall(fr"({'|'.join(my_list)})\b", sentence)
See https://www.regular-expressions.info/wordboundaries.html
You can use the following pattern:
\b(\w+(one|this))\b
It says match whole words within word boundaries (\b...\b), and within whole words match any word character (\w+) followed by the literal one or this ((one|this))
https://regex101.com/r/UzhnSw/1/

Python regex keep a few more tokens

I am using the following regex in Python to keep words that do not contain non alphabetical characters:
(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))
The problem is that this regex does not keep words that I would like to keep such as the following:
Company,
months.
third-party
In other words I would like to keep words that are followed by a comma, a dot, or have a dash between two words.
Any ideas on how to implement this?
I tried adding something like |(?<!\S)[A-Za-z]+(?=\.(?!\S)) for the dots but it does not seem to be working.
Thanks !
EDIT:
Should match these:
On-line
. These
maintenance,
other.
. Our
Google
Should NOT match these:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
NY7xtb92dCTfvEjdmkDrUw==
$As_Of_12_31_20104206http://www.sec.gov/CIK0001393311instant2010-12-31T00:00:000001-01-01T00:00:00falsefalseArlington/S.Cooper
-Publisher
gaap_RealEstateAndAccumulatedDepreciationCostsCapitalizedSubsequentToAcquisitionCarryingCostsus
At the moment I am using the following python code to read a text file line by line:
find_words = re.compile(r'(?<!\S)[A-Za-z]+(?!\S)|(?<!\S)[A-Za-z]+(?=:(?!\S))').findall
then i open the text file
contents = open("test.txt","r")
and I search for the words line by line for line in contents:
if find_words(line.lower()) != []: lineWords=find_words(line.lower())
print "The words in this line are: ", lineWords
using some word lists in the following way:
wanted1 = set(find_words(open('word_list_1.csv').read().lower()))
wanted2 = set(find_words(open('word_list_2.csv').read().lower()))
negators = set(find_words(open('word_list_3.csv').read().lower()))
i first want to get the valid words from the .txt file, and then check if these words belong in the word lists. the two steps are independent.
I propose this regex:
find_words = re.compile(r'(?:(?<=[^\w./-])|(?<=^))[A-Za-z]+(?:-[A-Za-z]+)*(?=\W|$)').findall
There are 3 parts from your initial regex that I changed:
Middle part:
[A-Za-z]+(?:-[A-Za-z]+)*
This allows hyphenated words.
The last part:
(?=\W|$)
This is a bit similar to (?!\S) except that it allows for characters that are not spaces like punctuations as well. So what happens is, this will allow a match if, after the word matched, the line ends, or there is a non-word character, in other words, there are no letters or numbers or underscores (if you don't want word_ to match word, then you will have to change \W to [a-zA-Z0-9]).
The first part (probably most complex):
(?:(?<=[^\w./-])|(?<=^))
It is composed of two parts itself which matches either (?<=[^\w./-]) or (?<=^). The second one allows a match if the line begins before the word to be matched. We cannot use (?<=[^\w./-]|^) because python's lookbehind from re cannot be of variable width (with [^\w./-] having a length of 1 and ^ a length of 0).
(?<=[^\w./-]) allows a match if, before the word, there are no word characters, periods, forward slashes or hyphens.
When broken down, the small parts are rather straightforward I think, but if there's anything you want some more elaboration, I can give more details.
This is not a regex task because you can not detect the words with regext.You must have a dictionary to check your words.
So i suggest use regex to split your string with non-alphabetical characters and check if the all of items exist in your dictionary.for example :
import re
words=re.split(r'\S+',my_string)
print all(i in my_dict for i in words if i)
As an alter native you can use nltk.corups as your dictionary :
from nltk.corpus import wordnet
words=re.split(r'\S+',my_string)
if all(wordnet.synsets(word) for i in words if i):
#do stuff
But if you want to use yourself word list you need to change your regex because its incorrect instead use re.split as preceding :
all_words = wanted1|wanted2|negators
with open("test.txt","r") as f :
for line in f :
for word in line.split():
words=re.split(r'\S+',word)
if all(i in all_words for i in words if i):
print word
Instead of using all sorts of complicated look-arounds, you can use \b to detect the boundary of words. This way, you can use e.g. \b[a-zA-Z]+(?:-[a-zA-Z]+)*\b
Example:
>>> p = r"\b[a-zA-Z]+(?:-[a-zA-Z]+)*\b"
>>> text = "This is some example text, with some multi-hyphen-words and invalid42 words in it."
>>> re.findall(p, text)
['This', 'is', 'some', 'example', 'text', 'with', 'some', 'multi-hyphen-words', 'and', 'words', 'in', 'it']
Update: Seems like this does not work too well, as it also detects fragments from URLs, e.g. www, sec and gov from http://www.sec.gov.
Instead, you might try this variant, using look-around explicitly stating the 'legal' characters:
r"""(?<![^\s("])[a-zA-Z]+(?:[-'][a-zA-Z]+)*(?=[\s.,:;!?")])"""
This seems to pass all your test-cases.
Let's dissect this regex:
(?<![^\s("]) - look-behind asserting that the word is preceeded by space, quote or parens, but e.g. not a number (using double-negation instead of positive look-behind so the first word is matched, too)
[a-zA-Z]+ - the first part of the word
(?:[-'][a-zA-Z]+)* - optionally more word-segments after a ' or -
(?=[\s.,:;!?")]) - look-ahead asserting that the word is followed by space, punctuation, quote or parens

Categories