I have a lot of data which contain the word "not".
For example : "not good".
I want to convert "not good" to "notgood" (without a space).
How can I convert all of the "not"s in the data, erasing the space after "not".
For example in the below list:
1. I am not beautiful → I am notbeautiful
2. She is not good to be a teacher → She is notgood to be a teacher
3. If I choose A, I think it's not bad decision → If I choose A, I think it's notbad decision
A simple way to do this would be to replace not_ with not, removing the space.
text = "I am not beautiful"
new_text = text.replace("not ", "not")
print(new_text)
Will output:
I am notbeautiful
I suggest that you use regular expression to match the word with boundary, in order to avoid matching phrases like "tying the knot with someone":
import re
output = re.replace(r'(?<=\bnot)\s+(?=\w+)', '', text)
OR:
s = "I am not beautiful"
news=''.join(i+' ' if i != 'not' else i for i in s.split())
print(news)
Output:
I am notbeautiful
If you care about the space at the end do:
print(news.rstrip())
Related
I have a text as follows.
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
I want to convert it to lowercase, except the words that has _ABB in it.
So, my output should look as follows.
mytext = "this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday"
My current code is as follows.
splits = mytext.split()
newtext = []
for item in splits:
if not '_ABB' in item:
item = item.lower()
newtext.append(item)
else:
newtext.append(item)
However, I want to know if there is any easy way of doing this, possibly in one line?
You can use a one liner splitting the string into words, check the words with str.endswith() and then join the words back together:
' '.join(w if w.endswith('_ABB') else w.lower() for w in mytext.split())
# 'this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday'
Of course use the in operator rather than str.endswith() if '_ABB' can actually occur anywhere in the word and not just at the end.
Extended regex approach:
import re
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
result = re.sub(r'\b((?!_ABB)\S)+\b', lambda m: m.group().lower(), mytext)
print(result)
The output:
this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday
Details:
\b - word boundary
(?!_ABB) - lookahead negative assertion, ensures that the given pattern will not match
\S - non-whitespace character
\b((?!_ABB)\S)+\b - the whole pattern matches a word NOT containing substring _ABB
Here is another possible(not elegant) one-liner:
mytext = "This is AVGs_ABB and NMN_ABB and most importantly GFD_ABB This is so important that you have to CLEAN the lab everyday"
print(' '.join(map(lambda x : x if '_ABB' in x else x.lower(), mytext.split())))
Which Outputs:
this is AVGs_ABB and NMN_ABB and most importantly GFD_ABB this is so important that you have to clean the lab everyday
Note: This assumes that your text will only seperate the words by spaces, so split() suffices here. If your text includes punctuation such as",!.", you will need to use regex instead to split up the words.
I am trying to remove all the single characters in a string
input: "This is a big car and it has a spacious seats"
my output should be:
output: "This is big car and it has spacious seats"
Here I am using the expression
import re
re.compile('\b(?<=)[a-z](?=)\b')
This matches with first single character in the string ...
Any help would be appreciated ...thanks in Advance
Edit: I have just seen that this was suggested in the comments first by Wiktor Stribiżew. Credit to him - I had not seen when this was posted.
You can also use re.sub() to automatically remove single characters (assuming you only want to remove alphabetical characters). The following will replace any occurrences of a single alphabetical character:
import re
input = "This is a big car and it has a spacious seats"
output = re.sub(r"\b[a-zA-Z]\b", "", input)
>>>
output = "This is big car and it has spacious seats"
You can learn more about inputting regex expression when replacing strings here: How to input a regex in string.replace?
Here's one way to do it by splitting the string and filtering out single length letters using len and str.isalpha:
>>> s = "1 . This is a big car and it has a spacious seats"
>>> ' '.join(i for i in s.split() if not (i.isalpha() and len(i)==1))
'1 . This is big car and it has spacious seats'
re.sub(r' \w{1} |^\w{1} | \w{1}$', ' ', input)
EDIT:
You can use:
import re
input_string = "This is a big car and it has a spacious seats"
str_without_single_chars = re.sub(r'(?:^| )\w(?:$| )', ' ', input_string).strip()
or (which as was brought to my attention, doesn't meet the specifications):
input_string = "This is a big car and it has a spacious seats"
' '.join(w for w in input_string.split() if len(w)>3)
The fastest way to remove words, characters, strings or anything between two known tags or two known characters in a string is by using a direct and Native C approach using RE along with a Common as shown below.
var = re.sub('<script>', '<!--', var)
var = re.sub('</script>', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '', var)
It removes everything and works faster, better and cleaner than Beautiful Soup.
Batch files are where the "" got there beginnings and were only borrowed for use with batch and html from native C". When using all Pythonic methods with regular expressions you have to realize that Python has not altered or changed much from all regular expressions used by Machine Language so why iterate many times when a single loop can find it all as one chunk in one iteration? Do the same individually with Characters also.
var = re.sub('\[', '<!--', var)
var = re.sub('\]', '-->', var)
And finally
var = re.sub('<!--.*?-->', '' var)# wipes it all out from between along with.
And you do not need Beautiful Soup. You can also scalp data using them if you understand how this works.
I asked a question a little while ago (Python splitting unknown string by spaces and parentheses) which worked great until I had to change my way of thinking. I have still not grasped regex so I need some help with this.
If the user types this:
new test (test1 test2 test3) test "test5 test6"
I would like it to look like the output to the variable like this:
["new", "test", "test1 test2 test3", "test", "test5 test6"]
In other words if it is one word seperated by a space then split it from the next word, if it is in parentheses then split the whole group of words in the parentheses and remove them. Same goes for the quotation marks.
I currently am using this code which does not meet the above standard (From the answers in the link above):
>>>import re
>>>strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
>>>[", ".join(x.split()) for x in re.split(r'[()]',strs) if x.strip()]
>>>['Hello', 'Test1, test2', 'Hello1, hello2', 'other_stuff']
This works well but there is a problem, if you have this:
strs = "Hello Test (Test1 test2) (Hello1 hello2) other_stuff"
It combines the Hello and Test as one split instead of two.
It also doesn't allow the use of parentheses and quotation marks splitting at the same time.
The answer was simply:
re.findall('\[[^\]]*\]|\([^\)]*\)|\"[^\"]*\"|\S+',strs)
This is pushing what regexps can do. Consider using pyparsing instead. It does recursive descent. For this task, you could use:
from pyparsing import *
import string, re
RawWord = Word(re.sub('[()" ]', '', string.printable))
Token = Forward()
Token << ( RawWord |
Group('"' + OneOrMore(RawWord) + '"') |
Group('(' + OneOrMore(Token) + ')') )
Phrase = ZeroOrMore(Token)
Phrase.parseString(s, parseAll=True)
This is robust against strange whitespace and handles nested parentheticals. It's also a bit more readable than a large regexp, and therefore easier to tweak.
I realize you've long since solved your problem, but this is one of the highest google-ranked pages for problems like this, and pyparsing is an under-known library.
Your problem is not well defined.
Your description of the rules is
In other words if it is one word seperated by a space then split it
from the next word, if it is in parentheses then split the whole group
of words in the parentheses and remove them. Same goes for the commas.
I guess with commas you mean inverted commas == quotation marks.
Then with this
strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
you should get that
["Hello (Test1 test2) (Hello1 hello2) other_stuff"]
since everything is surrounded by inverted commas. Most probably, you want to work with no care of largest inverted commas.
I propose this, although a bot ugly
import re, itertools
strs = raw_input("enter a string list ")
print [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
gets
>>>
enter a string list here there (x y ) thereagain "there there"
['here there ', 'x y ', ' thereagain ', 'there there']
This is doing what you expect
import re, itertools
strs = raw_input("enter a string list ")
res1 = [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
set1 = re.search(r'\"(.*)\"', strs).groups()
set2 = re.search(r'\((.*)\)', strs).groups()
print [k for k in res1 if k in list(set1) or k in list(set2) ]
+ list(itertools.chain(*[k.split() for k in res1 if k
not in set1 and k not in set2 ]))
For python 3.6 - 3.8
I had a similar question, however I like none of those answers, maybe because most of them are from 2013. So I elaborated my own solution.
regex = r'\(.+?\)|".+?"|\w+'
test = 'Hello Test (Test1 test2) (Hello1 hello2) other_stuff'
result = re.findall(regex, test)
Here you are looking for three different groups:
Something that is included inside (); parenthesis should be written together with backslashes
Something that is included inside ""
Just words
The use of ? makes your search lazy instead of greedy
I have patterns like this:
" 1+2;\r\n\r(%o2) 3\r\n(%i3) "
i'd like to split them up into:
[" 1+2;","(%o2) 3","(%i3)"]
the regex for the first pattern is hard to construct since it could be anything a user asks of an algebra system, the second could be:
'\(%o\d+\).'
and the last something like this:
'\(%i\d+\)
im not stumped by the regex part strictly but how to actually split once i know the correct pattern.
how would i split this?
How about splitting on (\r|\n)+?
Will this code work for you?
patterns = [p.strip() for x in " 1+2;\r\n\r(%o2) 3\r\n(%i3) ".split("\r\n")]
To clarify:
>>> patterns = " 1+2;\r\n\r(%o2) 3\r\n(%i3) ".split("\r\n")
>>> patterns
[' 1+2;', '\r(%o2) 3', '(%i3) ']
>>> patterns = [p.strip() for p in patterns]
['1+2;', '(%o2) 3', '(%i3)']
This way you split the lines and get rid from unnecessary white characters.
EDIT: also: Python String has also splitlines() method:
splitlines(...)
S.splitlines([keepends]) -> list of strings
Return a list of the lines in S, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends
is given and true.
So this code may be changed to:
patterns = [p.strip() for x in " 1+2;\r\n\r(%o2) 3\r\n(%i3) ".splitlines()]
This may possibly answer the problem with NL's without CR's and all different combinations.
I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy
The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers
That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).
p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.
As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.
(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'
You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.