how to prevent regex matching substring of words?

how to prevent regex matching substring of words? - python

I have a regex in python and I want to prevent matching substrings. I want to add '#' at the beginning some words with alphanumeric and _ character and 4 to 15 characters. But it matches substring of larger words. I have this method:
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'([a-zA-Z0-9_]{4,15})', r'#\1', str(sent))
return sents
And the example is :
mylist = list()
mylist.append("ali_s ali_t ali_u aabs:/t.co/kMMALke2l9")
add_atsign(mylist)
And the answer is :
['#ali_s #ali_t #ali_u #aabs:/t.co/#kMMALke2l9']
As you can see, it puts '#' at the beginning of 'aabs' and 'kMMALke2l9'. That it is wrong.
I tried to edit the code as bellow :
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|\s)[a-zA-Z0-9_]{4,15}(\s|$))', r'#\1', str(sent))
return sents
But the result will become like this :
['#ali_s ali_t# ali_u aabs:/t.co/kMMALke2l9']
As you can see It has wrong replacements.
The correct result I expect is:
"#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9"
Could anyone help?
Thanks

This is a pretty interesting question. If I understand correctly, the issue is that you want to divide the string by spaces, and then do the replacement only if the entire word matches, and not catch a substring.
I think the best way to do this is to first split by spaces, and then add assertions to your regex that catch only an entire string:
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^([a-zA-Z0-9_]{4,15})$', r'#\1', w)
for w in string.split()))
return new_list
mylist = ["ali_s ali_t ali_u aabs:/t.co/kMMALke2l9"]
add_atsign(mylist)
>
['#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9']
ie, we split, then replace only if the entire word matches, then rejoin.
By the way, your regex can be simplified to r'^(\w{4,15})$':
def add_atsign(sents):
new_list = []
for string in sents:
new_list.append(' '.join(re.sub(r'^(\w{4,15})$', r'#\1', w)
for w in string.split()))
return new_list

You can separate words by spaces by adding (?<=\s) to the start and \s to the end of your first expression.
def add_atsign(sents):
for i, sent in enumerate(sents):
sents[i] = re.sub(r'((^|(?<=\s))[a-zA-Z0-9_]{4,15}\s)', r'#\1', str(sent))
return sents
The result will be like this:
['#ali_s #ali_t #ali_u aabs:/t.co/kMMALke2l9']

I am not sure what you are trying to accomplish, but the reason it puts the # at the wrong places is that as you added /s or ^ to the regex the whitespace becomes part of the match and it therefore puts the # before the whitespace.
you could try to split it to
check at beginning of string and put at first position and
check after every whitespace and put to second position
Im aware its not optimal, but maybe i can help if you clarify what the regex is supposed to match and what it shouldnt in a bit more detail

Related

Why does the order of expressions matter in re.match?

I'm making a function that will take a string like "three()" or something like "{1 + 2}" and put them into a list of token (EX: "three()" = ["three", "(", ")"] I using the re.match to help separate the string.
def lex(s):
# scan input string and return a list of its tokens
seq = []
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/|)(\t|\n|\r| )*")
m = re.match(patterns,s)
while m != None:
if s == '':
break
seq.append(m.group(2))
s = s[len(m.group(0)):]
m = re.match(patterns,s)
return seq
This one works if the string is just "three". But if the string contains "()" or any symbol it stays in the while loop.
But a funny thing happens when move ([a-z])* in the pattern string it works. Why is that happening?
works: patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
Does not work: patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")

This one is a bit tricky, but the problem is with this part ([a-z])*. This matches any string of lowercase letters size 0 (zero) or more.
If you put this sequence at the end, like here:
patterns = (r"^(\t|\n|\r| )*([0-9]|\(|\)|\*|\/|([a-z])*)(\t|\n|\r| )*")
The regex engine will try the other matches first, and if it finds a match, stop there. Only if none of the others match, does it try ([a-z])* and since * is 'greedy', it will match all of three, then proceed to match ( and finally ).
Read an explanation of how the full expression is tested in the documentation (thanks to #kaya3).
However, if you put that sequence a the start, like here:
patterns = (r"^(\t|\n|\r| )*(([a-z])*|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")
It will now try to match it first. It's still greedy, so three still gets matched. But then on the next try, it will try to match ([a-z])* to the remaining '()' - and it matches, since that string starts with zero letters.
It keeps matching it like that, and gets stuck in the loop. You can fix it by changing the * for a + which will only match if there is 1 or more matches:
patterns = (r"^(\t|\n|\r| )*(([a-z])+|[0-9]|\(|\)|\*|\/)(\t|\n|\r| )*")

Find the word from the list given and replace the words so found

My question is pretty simple, but I haven't been able to find a proper solution.
Given below is my program:
given_list = ["Terms","I","want","to","remove","from","input_string"]
input_string = input("Enter String:")
if any(x in input_string for x in given_list):
#Find the detected word
#Not in bool format
a = input_string.replace(detected_word,"")
print("Some Task",a)
Here, given_list contains the terms I want to exclude from the input_string.
Now, the problem I am facing is that the any() produces a bool result and I need the word detected by the any() and replace it with a blank, so as to perform some task.
Edit: any() function is not required at all, look for useful solutions below.

Iterate over given_list and replace them:
for i in given_list:
input_string = input_string.replace(i, "")
print("Some Task", input_string)

No need to detect at all:
for w in given_list:
input_string = input_string.replace(w, "")
str.replace will not do anything if the word is not there and the substring test needed for the detection has to scan the string anyway.

The problem with finding each word and replacing it is that python will have to iterate over the whole string, repeatedly. Another problem is you will find substrings where you don't want to. For example, "to" is in the exclude list, so you'd end up changing "tomato" to "ma"
It seems to me like you seem to want to replace whole words. Parsing is a whole new subject, but let's simplify. I'm just going to assume everything is lowercase with no punctuation, although that can be improved later. Let's use input_string.split() to iterate over whole words.
We want to replace some words with nothing, so let's just iterate over the input_string, and filter out the words we don't want, using the builtin function of the same name.
exclude_list = ["terms","i","want","to","remove","from","input_string"]
input_string = "one terms two i three want to remove"
keepers = filter(lambda w: w not in exclude_list, input_string.lower().split())
output_string = ' '.join(keepers)
print (output_string)
one two three
Note that we create an iterator that allows us to go through the whole input string just once. And instead of replacing words, we just basically skip the ones we don't want by having the iterator not return them.
Since filter requires a function for the boolean check on whether to include or exclude each word, we had to define one. I used "lambda" syntax to do that. You could just replace it with
def keep(word):
return word not in exclude_list
keepers = filter(keep, input_string.split())

To answer your question about any, use an assignment expression (Python 3.8+).
if any((word := x) in input_string for x in given_list):
# match captured in variable word

How would I remove the Arabic prefix "ال" from an arabic string?

I have tried things like this, but there is no change between the input and output:
def remove_al(text):
if text.startswith('ال'):
text.replace('ال','')
return text

text.replace returns the updated string but doesn't change it, you should change the code to
text = text.replace(...)
Note that in Python strings are "immutable"; there's no way to change even a single character of a string; you can only create a new string with the value you want.

If you want to only remove the prefix ال and not all of ال combinations in the string, I'd rather suggest to use:
def remove_prefix_al(text):
if text.startswith('ال'):
return text[2:]
return text
If you simply use text.replace('ال',''), this will replace all ال combinations:
Example
text = 'الاستقلال'
text.replace('ال','')
Output:
'استقل'

I would recommend the method str.lstrip instead of rolling your own in this case.
example text (alrashid) in Arabic: 'الرَشِيد'
text = 'الرَشِيد'
clean_text = text.lstrip('ال')
print(clean_text)
Note that even though arabic reads from right to left, lstrip strips the start of the string (which is visually to the right)
also, as user 6502 noted, the issue in your code is because python strings are immutable, thus the function was returning the input back

"ال" as prefix is quite complex in Arabic that you will need Regex to accurately separate it from its stem and other prefixes. The following code will help you isolate "ال" from most words:
import re
text = 'والشعر كالليل أسود'
words = text.split()
for word in words:
alx = re.search(r'''^
([وف])?
([بك])?
(لل)?
(ال)?
(.*)$''', word, re.X)
groups = [alx.group(1), alx.group(2), alx.group(3), alx.group(4), alx.group(5)]
groups = [x for x in groups if x]
print (word, groups)
Running that (in Jupyter) you will get:

Using strip() to remove only one element

I have a word within two opening and closing parenthesis, like this ((word)).
I want to remove the first and the last parenthesis, so they are not duplicate, in order to obtain something like this: (word).
I have tried using strip('()') on the variable that contains ((word)). However, it removes ALL parentheses at the beginning and at the end. Result: word.
Is there a way to specify that I only want the first and last one removed?

For this you could slice the string and only keep from the second character until the second to last character:
word = '((word))'
new_word = word[1:-1]
print(new_word)
Produces:
(word)
For varying quantities of parenthesis, you could count how many exist first and pass this to the slicing as such (this leaves only 1 bracket on each side, if you want to remove only the first and last bracket you can use the first suggestion);
word ='((((word))))'
quan = word.count('(')
new_word = word[quan-1:1-quan]
print(new_word)
Produces;
(word)

You can use regex.
import re
word = '((word))'
re.findall('(\(?\w+\)?)', word)[0]
This only keeps one pair of brackets.

instead use str.replace, so you would do str.replace('(','',1)
basically you would replace all '(' with '', but the third argument will only replace n instances of the specified substring (as argument 1), hence you will only replace the first '('
read the documentation :
replace(...)
S.replace (old, new[, count]) -> string
Return a copy of string S with all occurrences of substring
old replaced by new. If the optional argument count is
given, only the first count occurrences are replaced.

you can replace double opening and double closing parentheses, and set the max parameter to 1 for both operations
print('((word))'.replace('((','(',1).replace('))',')',1) )
But this will not work if there are more occurrences of double closing parentheses
Maybe reversing the string before replacing the closing ones will help
t= '((word))'
t = t.replace('((','(',1)
t = t[::-1] # see string reversion topic [https://stackoverflow.com/questions/931092/reverse-a-string-in-python]
t = t.replace('))',')',1) )
t = t[::-1] # and reverse again

Well , I used regular expression for this purpose and substitute a bunch of brackets with a single one using re.sub function
import re
s="((((((word)))))))))"
t=re.sub(r"\(+","(",s)
g=re.sub(r"\)+",")",t)
print(g)
Output
(word)

Try below:
>>> import re
>>> w = '((word))'
>>> re.sub(r'([()])\1+', r'\1', w)
'(word)'
>>> w = 'Hello My ((word)) into this world'
>>> re.sub(r'([()])\1+', r'\1', w)
'Hello My (word) into this world'
>>>

try this one:
str="((word))"
str[1:len(str)-1]
print (str)
And output is = (word)

How can lines with other characters than letters be removed from output?

I have a code where I extract bigrams from a large corpus, and concatenate/merge them to get unigrams. 'may', 'be' --> maybe. The corpus contains, of course, a lot of punctuations, but I also discovered that it contains other characters such as emojis... My plan was to put punctuations in a list, and if those characters are not in a line, print the line. Maybe I should change my approach and only print the lines ONLY containing letters and no other characters, since I don't know what kinds of characters are in the corpus. How can this be done? I do need to keep these other characters for the first part of the code, so that bigrams that don't actually exist are printed. The last lines of my code are at the moment:
counted = collections.Counter(grams)
for gram, count in sorted(counted.items()):
s = ''
print (s.join(gram))
And the output I get is:
!aku
!bet
!brå
!båda
These lines won't be of any use for me... Would really appreciate some help! :)

If you want to check that each string contains only letters you can probably use the isalpha() method.
>>> '!båda'.isalpha()
False
>>> 'båda'.isalpha()
True
As you can see from the example, this method should recognize any unicode letter, not just ascii.

To filter out strings that contain a non-letter character, the code can check for the existence of non-letter character in each string:
# coding=utf-8
import string
import unicodedata
source_strings = [u'aku', u'bet', u'brå', u'båda', u'!båda']
valid_chars = (set(string.ascii_letters))
valid_strings = [s for s in source_strings if
set(unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')) <= valid_chars]
# valid_strings == [u'aku', u'bet', u'brå', u'båda']
# "båda" was not included.

You can use the unicodedata module to classify the characters:
import unicodedata
unigram= ''.join(gram)
if all(unicodedata.category(char)=='Ll' for char in unigram):
print(unigram)

If you want to remove from your lines only some characters, then you can filter with an easy replace your line before edit it:
sourceList = ['!aku', '!bet', '!brå', '!båda']
newList = []
for word in sourceList:
for special in ['!','&','å']:
word = word.replace(special,'')
newList.append(word)
Then you can do what is needed for your bigram exercise. Hope this help.
Second query: in case you have lots of characters then on your string you can use always the isalpha():
sourceList = ['!aku', '!bet', 'nor mal alpha', '!brå', '!båda']
newList = [word for word in sourceList if word.isalpha()]
In this case you will only check for characters. Hope this clarify second query.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to prevent regex matching substring of words? - python

Related

Why does the order of expressions matter in re.match?

Find the word from the list given and replace the words so found

How would I remove the Arabic prefix "ال" from an arabic string?

Using strip() to remove only one element

How can lines with other characters than letters be removed from output?

Categories

Resources