I have a text as follows.
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
I want to convert it to lowercase, except the words that has _ABB in it.
So, my output should look as follows.
mytext = "this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday"
My current code is as follows.
splits = mytext.split()
newtext = []
for item in splits:
if not '_ABB' in item:
item = item.lower()
newtext.append(item)
else:
newtext.append(item)
However, I want to know if there is any easy way of doing this, possibly in one line?
You can use a one liner splitting the string into words, check the words with str.endswith() and then join the words back together:
' '.join(w if w.endswith('_ABB') else w.lower() for w in mytext.split())
# 'this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday'
Of course use the in operator rather than str.endswith() if '_ABB' can actually occur anywhere in the word and not just at the end.
Extended regex approach:
import re
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
result = re.sub(r'\b((?!_ABB)\S)+\b', lambda m: m.group().lower(), mytext)
print(result)
The output:
this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday
Details:
\b - word boundary
(?!_ABB) - lookahead negative assertion, ensures that the given pattern will not match
\S - non-whitespace character
\b((?!_ABB)\S)+\b - the whole pattern matches a word NOT containing substring _ABB
Here is another possible(not elegant) one-liner:
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
print(' '.join(map(lambda x : x if '_ABB' in x else x.lower(), mytext.split())))
Which Outputs:
this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday
Note: This assumes that your text will only seperate the words by spaces, so split() suffices here. If your text includes punctuation such as",!.", you will need to use regex instead to split up the words.
Related
I am still new to regular expressions, as in the Python library re.
I want to extract all the proper nouns as a whole word if they are separated by space.
I tried
result = re.findall(r'(\w+)\w*/NNP (\w+)\w*/NNP', tagged_sent_str)
Input: I have a string like
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
Output expected:
[('European Community'), ('European')]
Current output:
[('European','Community')]
But this will only give the pairs not the single ones. I want all the kinds
IIUC, itertools.groupby is more suited for this kind of job:
from itertools import groupby
def join_token(string_, type_ = 'NNP'):
res = []
for k, g in groupby([i.split('/') for i in string_.split()], key=lambda x:x[1]):
if k == type_:
res.append(' '.join(i[0] for i in g))
return res
join_token(tagged_sent_str)
Output:
['European Community', 'European']
and it doesn't require a modification if you expect three or more consecutive types:
str2 = "European/NNP Community/NNP Union/NNP French/JJ European/NNP export/VB"
join_token(str2)
Output:
['European Community Union', 'European']
Interesting requirement. Code is explained in the comments, a very fast solution using only REGEX:
import re
# make it more complex
text = "export1/VB European0/NNP export/VB European1/NNP Community1/NNP Community2/NNP French/JJ European2/NNP export/VB European2/NNP"
# 1: First clean app target words word/NNP to word,
# you can use str.replace but just to show you a technique
# how to to use back reference of the group use \index_of_group
# re.sub(r'/NNP', '', text)
# text.replace('/NNP', '')
_text = re.sub(r'(\w+)/NNP', r'\1', text)
# this pattern strips the leading and trailing spaces
RE_FIND_ALL = r'(?:\s+|^)((?:(?:\s|^)?\w+(?=\s+|$)?)+)(?:\s+|$)'
print('RESULT : ', re.findall(RE_FIND_ALL, _text))
OUTPUT:
RESULT : ['European0', 'European1 Community1 Community2', 'European2', 'European2']
Explaining REGEX:
(?:\s+|^) : skip leading spaces
((?:(?:\s)?\w+(?=\s+|$))+): capture a group of non copture subgroup (?:(?:\s)?\w+(?=\s+|$)) subgroup will match all sequence words folowed by spaces or end of line. and that match will be captured by the global group. if we don't do this the match will return only the first word.
(?:\s+|$) : remove trailing space of the sequence
I needed to remove /NNP from the target words because you want to keep the sequence of word/NNP in a single group, doing something like this (word)/NNP (word)/NPP this will return two elements in one group but not as a single text, so by removing it the text will be word word so REGEX ((?:\w+\s)+) will capture the sequence of word but it's not a simple as this because we need to capture the word that doesn't contain /sequence_of_letter at the end, no need to loop over the matched groups to concatenate element to build a valid text.
NOTE: both solutions work fine if all words are in this format word/sequence_of_letters; if you have words that are not in this format
you need to fix those. If you want to keep them add /NPP at the end of each word, else add /DUMMY to remove them.
Using re.split but slow because I'm using list comprehensive to fix result:
import re
# make it more complex
text = "export1/VB Europian0/NNP export/VB Europian1/NNP Community1/NNP Community2/NNP French/JJ Europian2/NNP export/VB Europian2/NNP export/VB export/VB"
RE_SPLIT = r'\w+/[^N]\w+'
result = [x.replace('/NNP', '').strip() for x in re.split(RE_SPLIT, text) if x.strip()]
print('RESULT: ', result)
You'd like to get a pattern but with some parts deleted from it.
You can get it with two successive regexes:
tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"
[ re.sub(r"/NNP","",s) for s in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*",tagged_sent_str) ]
['European Community', 'European']
I have a lot of data which contain the word "not".
For example : "not good".
I want to convert "not good" to "notgood" (without a space).
How can I convert all of the "not"s in the data, erasing the space after "not".
For example in the below list:
1. I am not beautiful → I am notbeautiful
2. She is not good to be a teacher → She is notgood to be a teacher
3. If I choose A, I think it's not bad decision → If I choose A, I think it's notbad decision
A simple way to do this would be to replace not_ with not, removing the space.
text = "I am not beautiful"
new_text = text.replace("not ", "not")
print(new_text)
Will output:
I am notbeautiful
I suggest that you use regular expression to match the word with boundary, in order to avoid matching phrases like "tying the knot with someone":
import re
output = re.replace(r'(?<=\bnot)\s+(?=\w+)', '', text)
OR:
s = "I am not beautiful"
news=''.join(i+' ' if i != 'not' else i for i in s.split())
print(news)
Output:
I am notbeautiful
If you care about the space at the end do:
print(news.rstrip())
I'm doing research on sentiment analysis. In a list of data, I'd like to remove all punctuation, in orde to get to the words in their pure version. But I would like to keep emoticons, such as :) and :/.
Is there a way to say in Python that I want to remove all punctuation signs unless they appear in a combination such as ":)", ":/", "<3"?
Thanks in advance
This is my code for the stripping:
for message in messages:
message=message.lower()
message=message.replace("!","")
message=message.replace(".","")
message=message.replace(",","")
message=message.replace(";","")
message=message.replace(";","")
message=message.replace("?","")
message=message.replace("/","")
message=message.replace("#","")
You can try this regex:
(?<=\w)[^\s\w](?![^\s\w])
Usage:
import re
print(re.sub(r'(?<=\w)[^\s\w](?![^\s\w])', '', your_data))
Here is an online demo.
The idea is to match a single special character if it is preceded by a letter.
If the regex doesn't work as you expect, you can customize it a little. For example if you don't want it to match commas, you can remove them from the character class like so: (?<=\w)[^\s\w,](?![^\s\w]). Or if you want to remove the emoticon :-), you can add it to the regex like so: (?<=\w)[^\s\w](?![^\s\w])|:-\).
Going off of the work you've already done using str.replace, you could do something like this:
lines = [
"Sentence 1.",
"Sentence 2 :)",
"Sentence <3 ?"
]
emoticons = {
":)": "000smile",
"<3": "000heart"
}
emoticons_inverse = {v: k for k, v in emoticons.items()}
punctuation = ",./<>?;':\"[]\\{}|`~!##$%^&*()_+-="
lines_clean = []
for line in lines:
#Replace emoticons with non-punctuation
for emote, rpl in emoticons.items():
line = line.replace(emote, rpl)
#Remove punctuation
for char in line:
if char in punctuation:
line = line.replace(char, "")
#Revert emoticons
for emote, rpl in emoticons_inverse.items():
line = line.replace(emote, rpl)
lines_clean.append(line)
print(lines_clean)
This is not super efficient, though, so if performance becomes a bottleneck you might want to examine how you can make this faster.
Output: python3 test.py
['Sentence 1', 'Sentence 2 :)', 'Sentence <3 ']
Your best bet might be to simply declare a list of emoticons as a variable. Then compare your punctuation to the list. If it's not in the list, remove it from the string.
Edit: Instead of using a whole block of str.replace() over and over, you might try something like:
to_remove = ".,;:!()\"
for char in to_remove:
message = message.replace(char, "")
Edit 2:
The simplest way (skill-wise) might be to try this:
from string import punctuation
emoticons = [":)" ":D" ":("]
word_list = message.split(" ")
for word in word_list:
if word not in emoticons:
word = word.translate(None, punctuation)
output = " ".join(word_list)
Once again, this will only work on emoticons that are separated from other characters, i.e. "Sure :D" but not "Sorry:(".
I want to replace(re-spell) a word A in a text string with another word B if the word A occurs before an operator. Word A can be any word.
E.G:
Hi I am Not == you
Since "Not" occurs before operator "==", I want to replace it with alist["Not"]
So, above sentence should changed to
Hi I am alist["Not"] == you
Another example
My height > your height
should become
My alist["height"] > your height
Edit:
On #Paul's suggestion, I am putting the code which I wrote myself.
It works but its too bulky and I am not happy with it.
operators = ["==", ">", "<", "!="]
text_list = text.split(" ")
for index in range(len(text_list)):
if text_list[index] in operators:
prev = text_list[index - 1]
if "." in prev:
tokens = prev.split(".")
prev = "alist"
for token in tokens:
prev = "%s[\"%s\"]" % (prev, token)
else:
prev = "alist[\"%s\"]" % prev
text_list[index - 1] = prev
text = " ".join(text_list)
This can be done using regular expressions
import re
...
def replacement(match):
return "alist[\"{}\"]".format(match.group(0))
...
re.sub(r"[^ ]+(?= +==)", replacement, s)
If the space between the word and the "==" in your case is not needed, the last line becomes:
re.sub(r"[^ ]+(?= *==)", replacement, s)
I'd highly recommend you to look into regular expressions, and the python implementation of them, as they are really useful.
Explanation for my solution:
re.sub(pattern, replacement, s) replaces occurences of patterns, that are given as regular expressions, with a given string or the output of a function.
I use the output of a function, that puts the whole matched object into the 'alist["..."]' construct. (match.group(0) returns the whole match)
[^ ] match anything but space.
+ match the last subpattern as often as possible, but at least once.
* match the last subpattern as often as possible, but it is optional.
(?=...) is a lookahead. It checks if the stuff after the current cursor position matches the pattern inside the parentheses, but doesn't include them in the final match (at least not in .group(0), if you have groups inside a lookahead, those are retrievable by .group(index)).
str = "Hi I am Not == you"
s = str.split()
y = ''
str2 = ''
for x in s:
if x in "==":
str2 = str.replace(y, 'alist["'+y+'"]')
break
y = x
print(str2)
You could try using the regular expression library I was able to create a simple solution to your problem as shown here.
import re
data = "Hi I am Not == You"
x = re.search(r'(\w+) ==', data)
print(x.groups())
In this code, re.search looks for the pattern of (1 or more) alphanumeric characters followed by operator (" ==") and stores the result ("Hi I am Not ==") in variable x.
Then for swaping you could use the re.sub() method which CodenameLambda suggested.
I'd also recommend learning how to use regular expressions, as they are useful for solving many different problems and are similar between different programming languages
I am building a forum application in Django and I want to make sure that users dont enter certain characters in their forum posts. I need an efficient way to scan their whole post to check for the invalid characters. What I have so far is the following although it does not work correctly and I do not think the idea is very efficient.
def clean_topic_message(self):
topic_message = self.cleaned_data['topic_message']
words = topic_message.split()
if (topic_message == ""):
raise forms.ValidationError(_(u'Please provide a message for your topic'))
***for word in words:
if (re.match(r'[^<>/\{}[]~`]$',topic_message)):
raise forms.ValidationError(_(u'Topic message cannot contain the following: <>/\{}[]~`'))***
return topic_message
Thanks for any help.
For a regex solution, there are two ways to go here:
Find one invalid char anywhere in the string.
Validate every char in the string.
Here is a script that implements both:
import re
topic_message = 'This topic is a-ok'
# Option 1: Invalidate one char in string.
re1 = re.compile(r"[<>/{}[\]~`]");
if re1.search(topic_message):
print ("RE1: Invalid char detected.")
else:
print ("RE1: No invalid char detected.")
# Option 2: Validate all chars in string.
re2 = re.compile(r"^[^<>/{}[\]~`]*$");
if re2.match(topic_message):
print ("RE2: All chars are valid.")
else:
print ("RE2: Not all chars are valid.")
Take your pick.
Note: the original regex erroneously has a right square bracket in the character class which needs to be escaped.
Benchmarks: After seeing gnibbler's interesting solution using set(), I was curious to find out which of these methods would actually be fastest, so I decided to measure them. Here are the benchmark data and statements measured and the timeit result values:
Test data:
r"""
TEST topic_message STRINGS:
ok: 'This topic is A-ok. This topic is A-ok.'
bad: 'This topic is <not>-ok. This topic is {not}-ok.'
MEASURED PYTHON STATEMENTS:
Method 1: 're1.search(topic_message)'
Method 2: 're2.match(topic_message)'
Method 3: 'set(invalid_chars).intersection(topic_message)'
"""
Results:
r"""
Seconds to perform 1000000 Ok-match/Bad-no-match loops:
Method Ok-time Bad-time
1 1.054 1.190
2 1.830 1.636
3 4.364 4.577
"""
The benchmark tests show that Option 1 is slightly faster than option 2 and both are much faster than the set().intersection() method. This is true for strings which both match and don't match.
You have to be much more careful when using regular expressions - they are full of traps.
in the case of [^<>/\{}[]~] the first ] closes the group which is probably not what you intended. If you want to use ] in a group it has to be the first character after the [ eg []^<>/\{}[~]
simple test confirms this
>>> import re
>>> re.search("[[]]","]")
>>> re.search("[][]","]")
<_sre.SRE_Match object at 0xb7883db0>
regex is overkill for this problem anyway
def clean_topic_message(self):
topic_message = self.cleaned_data['topic_message']
invalid_chars = '^<>/\{}[]~`$'
if (topic_message == ""):
raise forms.ValidationError(_(u'Please provide a message for your topic'))
if set(invalid_chars).intersection(topic_message):
raise forms.ValidationError(_(u'Topic message cannot contain the following: %s'%invalid_chars))
return topic_message
If efficiency is a major concern I would re.compile() the re string, since you're going to use the same regex many times.
re.match and re.search behave differently. Splitting words is not required to search using regular expressions.
import re
symbols_re = re.compile(r"[^<>/\{}[]~`]");
if symbols_re.search(self.cleaned_data('topic_message')):
//raise Validation error
I can't say what would be more efficient, but you certainly should get rid of the $ (unless it's an invalid character for the message)... right now you only match the re if the characters are at the end of topic_message because $ anchors the match to the right-hand side of the line.
In any case you need to scan the entire message. So wouldn't something simple like this work ?
def checkMessage(topic_message):
for char in topic_message:
if char in "<>/\{}[]~`":
return False
return True
is_valid = not any(k in text for k in '<>/{}[]~`')
I agree with gnibbler, regex is an overkiller for this situation. Probably after removing this unwanted chars you'll want to remove unwanted words also, here's a little basic way to do it:
def remove_bad_words(title):
'''Helper to remove bad words from a sentence based in a dictionary of words.
'''
word_list = title.split(' ')
for word in word_list:
if word in BAD_WORDS: # BAD_WORDS is a list of unwanted words
word_list.remove(word)
#let's build the string again
title2 = u''
for word in word_list:
title2 = ('%s %s') % (title2, word)
#title2 = title2 + u' '+ word
return title2
Example: just tailor to your needs.
### valid chars: 0-9 , a-z, A-Z only
import re
REGEX_FOR_INVALID_CHARS=re.compile( r'[^0-9a-zA-Z]+' )
list_of_invalid_chars_found=REGEX_FOR_INVALID_CHARS.findall( topic_message )