python regex to replace all single word characters in string - python

I am trying to remove all the single characters in a string
input: "This is a big car and it has a spacious seats"
my output should be:
output: "This is big car and it has spacious seats"
Here I am using the expression
import re
re.compile('\b(?<=)[a-z](?=)\b')
This matches with first single character in the string ...
Any help would be appreciated ...thanks in Advance

Edit: I have just seen that this was suggested in the comments first by Wiktor Stribiżew. Credit to him - I had not seen when this was posted.
You can also use re.sub() to automatically remove single characters (assuming you only want to remove alphabetical characters). The following will replace any occurrences of a single alphabetical character:
import re
input = "This is a big car and it has a spacious seats"
output = re.sub(r"\b[a-zA-Z]\b", "", input)
>>>
output = "This is big car and it has spacious seats"
You can learn more about inputting regex expression when replacing strings here: How to input a regex in string.replace?

Here's one way to do it by splitting the string and filtering out single length letters using len and str.isalpha:
>>> s = "1 . This is a big car and it has a spacious seats"
>>> ' '.join(i for i in s.split() if not (i.isalpha() and len(i)==1))
'1 . This is big car and it has spacious seats'

re.sub(r' \w{1} |^\w{1} | \w{1}$', ' ', input)

EDIT:
You can use:
import re
input_string = "This is a big car and it has a spacious seats"
str_without_single_chars = re.sub(r'(?:^| )\w(?:$| )', ' ', input_string).strip()
or (which as was brought to my attention, doesn't meet the specifications):
input_string = "This is a big car and it has a spacious seats"
' '.join(w for w in input_string.split() if len(w)>3)

The fastest way to remove words, characters, strings or anything between two known tags or two known characters in a string is by using a direct and Native C approach using RE along with a Common as shown below.
var = re.sub('<script>', '<!--', var)
var = re.sub('</script>', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '', var)
It removes everything and works faster, better and cleaner than Beautiful Soup.
Batch files are where the "" got there beginnings and were only borrowed for use with batch and html from native C". When using all Pythonic methods with regular expressions you have to realize that Python has not altered or changed much from all regular expressions used by Machine Language so why iterate many times when a single loop can find it all as one chunk in one iteration? Do the same individually with Characters also.
var = re.sub('\[', '<!--', var)
var = re.sub('\]', '-->', var)
And finally
var = re.sub('<!--.*?-->', '' var)# wipes it all out from between along with.
And you do not need Beautiful Soup. You can also scalp data using them if you understand how this works.

Related

Convert negation text to text in python

I have a lot of data which contain the word "not".
For example : "not good".
I want to convert "not good" to "notgood" (without a space).
How can I convert all of the "not"s in the data, erasing the space after "not".
For example in the below list:
1. I am not beautiful → I am notbeautiful
2. She is not good to be a teacher → She is notgood to be a teacher
3. If I choose A, I think it's not bad decision → If I choose A, I think it's notbad decision
A simple way to do this would be to replace not_ with not, removing the space.
text = "I am not beautiful"
new_text = text.replace("not ", "not")
print(new_text)
Will output:
I am notbeautiful
I suggest that you use regular expression to match the word with boundary, in order to avoid matching phrases like "tying the knot with someone":
import re
output = re.replace(r'(?<=\bnot)\s+(?=\w+)', '', text)
OR:
s = "I am not beautiful"
news=''.join(i+' ' if i != 'not' else i for i in s.split())
print(news)
Output:
I am notbeautiful
If you care about the space at the end do:
print(news.rstrip())

Regex re.findall() search to extract unit beginning with # and postcode

I am using Python 3.6 and trying to extract some building unit that starts with # in a string and some postcode using re.findall() (following explanation obtained here Extracting phone numbers from a free form text in python by using regex). I don't know exactly how the structure works and I do not get the result I am looking for.
Here is my code
string='Road #10-13, Tree 26739 #23.04 934047 Holiday'
re.findall(r'[#][0-9(\)][0-9 ,\.\-\(\)]{8,}[0-9 ,\(\)]', string)
Basically I would like to obtain something like
['#10-13,','#23.04 934047 ']
But I only obtain because there is a comma after #10-13:
['#23.04 934047 ']
What I want to change in my query is saying the string as to end with a number between 0-9 OR ','. Because even if I change the string and add a ',' after #23.04 I would still get the same result.
Could someone also explain to me the meaning of {8,} ?
Your problem is not the comma. You problem is that {8,} requires a match with 8 or more chars abd #10-13, has only 7 total, 5 for that part. Changing it to {5,} makes it work:
>>> re.findall(r'[#][0-9(\)][0-9 ,\.\-\(\)]{5,}[0-9 ,\(\)]', string)
['#10-13, ', '#23.04 934047 ']
I would use a simpler approach though, not sure if it matches all your requirements but it certainly works here:
>>> re.findall(r'#[-,.\d ()]+', string)
['#10-13, ', '#23.04 934047 ']
You can use a look-ahead. ie, extract part of the string that starts with an # then followed by anything as long as there is a non word character(s) eg space or , that are immediately followed by letters
re.findall("#.+?(?=\\W+[A-Z])",string)
['#10-13', '#23.04 934047']
I feel the regex could be a lot simpler
string='Road #10-13, Tree 26739 #23.04 934047 Holiday'
re.findall(r'#[\d\- \.]+', string)
outputs:
['#10-13, ', '#23.04 934047 ']

Python - remove parts of a string

I have many fill-in-the-blank sentences in strings,
e.g. "6d) We took no [pains] to hide it ."
How can I efficiently parse this string (in Python) to be
"We took no to hide it"?
I also would like to be able to store the word in brackets (e.g. "pains") in a list for use later. I think the regex module could be better than Python string operations like split().
This will give you all the words inside the brackets.
import re
s="6d) We took no [pains] to hide it ."
matches = re.findall('\[(.*?)\]', s)
Then you can run this to remove all bracketed words.
re.sub('\[(.*?)\]', '', s)
just for fun (to do the gather and substitution in one iteration)
matches = []
def subber(m):
matches.append(m.groups()[0])
return ""
new_text = re.sub("\[(.*?)\]",subber,s)
print new_text
print matches
import re
s = 'this is [test] string'
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
Output
'test'
For your example you could use this regex:
(.*\))(.+)\[(.+)\](.+)
You will get four groups that you can use to create your resulting string and save the 3. group for later use:
6d)
We took no
pains
to hide it .
I used .+ here because I don't know if your strings always look like your example. You can change the .+ to alphanumeric or sth. more special to your case.
import re
s = '6d) We took no [pains] to hide it .'
m = re.search(r"(.*\))(.+)\[(.+)\](.+)", s)
print(m.group(2) + m.group(4)) # "We took no to hide it ."
print(m.group(3)) # pains
import re
m = re.search(".*\) (.*)\[.*\] (.*)","6d) We took no [pains] to hide it .")
if m:
g = m.groups()
print g[0] + g[1]
Output :
We took no to hide it .

Python splitting string by parentheses

I asked a question a little while ago (Python splitting unknown string by spaces and parentheses) which worked great until I had to change my way of thinking. I have still not grasped regex so I need some help with this.
If the user types this:
new test (test1 test2 test3) test "test5 test6"
I would like it to look like the output to the variable like this:
["new", "test", "test1 test2 test3", "test", "test5 test6"]
In other words if it is one word seperated by a space then split it from the next word, if it is in parentheses then split the whole group of words in the parentheses and remove them. Same goes for the quotation marks.
I currently am using this code which does not meet the above standard (From the answers in the link above):
>>>import re
>>>strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
>>>[", ".join(x.split()) for x in re.split(r'[()]',strs) if x.strip()]
>>>['Hello', 'Test1, test2', 'Hello1, hello2', 'other_stuff']
This works well but there is a problem, if you have this:
strs = "Hello Test (Test1 test2) (Hello1 hello2) other_stuff"
It combines the Hello and Test as one split instead of two.
It also doesn't allow the use of parentheses and quotation marks splitting at the same time.
The answer was simply:
re.findall('\[[^\]]*\]|\([^\)]*\)|\"[^\"]*\"|\S+',strs)
This is pushing what regexps can do. Consider using pyparsing instead. It does recursive descent. For this task, you could use:
from pyparsing import *
import string, re
RawWord = Word(re.sub('[()" ]', '', string.printable))
Token = Forward()
Token << ( RawWord |
Group('"' + OneOrMore(RawWord) + '"') |
Group('(' + OneOrMore(Token) + ')') )
Phrase = ZeroOrMore(Token)
Phrase.parseString(s, parseAll=True)
This is robust against strange whitespace and handles nested parentheticals. It's also a bit more readable than a large regexp, and therefore easier to tweak.
I realize you've long since solved your problem, but this is one of the highest google-ranked pages for problems like this, and pyparsing is an under-known library.
Your problem is not well defined.
Your description of the rules is
In other words if it is one word seperated by a space then split it
from the next word, if it is in parentheses then split the whole group
of words in the parentheses and remove them. Same goes for the commas.
I guess with commas you mean inverted commas == quotation marks.
Then with this
strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
you should get that
["Hello (Test1 test2) (Hello1 hello2) other_stuff"]
since everything is surrounded by inverted commas. Most probably, you want to work with no care of largest inverted commas.
I propose this, although a bot ugly
import re, itertools
strs = raw_input("enter a string list ")
print [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
gets
>>>
enter a string list here there (x y ) thereagain "there there"
['here there ', 'x y ', ' thereagain ', 'there there']
This is doing what you expect
import re, itertools
strs = raw_input("enter a string list ")
res1 = [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
set1 = re.search(r'\"(.*)\"', strs).groups()
set2 = re.search(r'\((.*)\)', strs).groups()
print [k for k in res1 if k in list(set1) or k in list(set2) ]
+ list(itertools.chain(*[k.split() for k in res1 if k
not in set1 and k not in set2 ]))
For python 3.6 - 3.8
I had a similar question, however I like none of those answers, maybe because most of them are from 2013. So I elaborated my own solution.
regex = r'\(.+?\)|".+?"|\w+'
test = 'Hello Test (Test1 test2) (Hello1 hello2) other_stuff'
result = re.findall(regex, test)
Here you are looking for three different groups:
Something that is included inside (); parenthesis should be written together with backslashes
Something that is included inside ""
Just words
The use of ? makes your search lazy instead of greedy

Split string from a regex in python re

I have patterns like this:
" 1+2;\r\n\r(%o2) 3\r\n(%i3) "
i'd like to split them up into:
[" 1+2;","(%o2) 3","(%i3)"]
the regex for the first pattern is hard to construct since it could be anything a user asks of an algebra system, the second could be:
'\(%o\d+\).'
and the last something like this:
'\(%i\d+\)
im not stumped by the regex part strictly but how to actually split once i know the correct pattern.
how would i split this?
How about splitting on (\r|\n)+?
Will this code work for you?
patterns = [p.strip() for x in " 1+2;\r\n\r(%o2) 3\r\n(%i3) ".split("\r\n")]
To clarify:
>>> patterns = " 1+2;\r\n\r(%o2) 3\r\n(%i3) ".split("\r\n")
>>> patterns
[' 1+2;', '\r(%o2) 3', '(%i3) ']
>>> patterns = [p.strip() for p in patterns]
['1+2;', '(%o2) 3', '(%i3)']
This way you split the lines and get rid from unnecessary white characters.
EDIT: also: Python String has also splitlines() method:
splitlines(...)
S.splitlines([keepends]) -> list of strings
Return a list of the lines in S, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends
is given and true.
So this code may be changed to:
patterns = [p.strip() for x in " 1+2;\r\n\r(%o2) 3\r\n(%i3) ".splitlines()]
This may possibly answer the problem with NL's without CR's and all different combinations.

Categories