I asked a question a little while ago (Python splitting unknown string by spaces and parentheses) which worked great until I had to change my way of thinking. I have still not grasped regex so I need some help with this.
If the user types this:
new test (test1 test2 test3) test "test5 test6"
I would like it to look like the output to the variable like this:
["new", "test", "test1 test2 test3", "test", "test5 test6"]
In other words if it is one word seperated by a space then split it from the next word, if it is in parentheses then split the whole group of words in the parentheses and remove them. Same goes for the quotation marks.
I currently am using this code which does not meet the above standard (From the answers in the link above):
>>>import re
>>>strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
>>>[", ".join(x.split()) for x in re.split(r'[()]',strs) if x.strip()]
>>>['Hello', 'Test1, test2', 'Hello1, hello2', 'other_stuff']
This works well but there is a problem, if you have this:
strs = "Hello Test (Test1 test2) (Hello1 hello2) other_stuff"
It combines the Hello and Test as one split instead of two.
It also doesn't allow the use of parentheses and quotation marks splitting at the same time.
The answer was simply:
re.findall('\[[^\]]*\]|\([^\)]*\)|\"[^\"]*\"|\S+',strs)
This is pushing what regexps can do. Consider using pyparsing instead. It does recursive descent. For this task, you could use:
from pyparsing import *
import string, re
RawWord = Word(re.sub('[()" ]', '', string.printable))
Token = Forward()
Token << ( RawWord |
Group('"' + OneOrMore(RawWord) + '"') |
Group('(' + OneOrMore(Token) + ')') )
Phrase = ZeroOrMore(Token)
Phrase.parseString(s, parseAll=True)
This is robust against strange whitespace and handles nested parentheticals. It's also a bit more readable than a large regexp, and therefore easier to tweak.
I realize you've long since solved your problem, but this is one of the highest google-ranked pages for problems like this, and pyparsing is an under-known library.
Your problem is not well defined.
Your description of the rules is
In other words if it is one word seperated by a space then split it
from the next word, if it is in parentheses then split the whole group
of words in the parentheses and remove them. Same goes for the commas.
I guess with commas you mean inverted commas == quotation marks.
Then with this
strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
you should get that
["Hello (Test1 test2) (Hello1 hello2) other_stuff"]
since everything is surrounded by inverted commas. Most probably, you want to work with no care of largest inverted commas.
I propose this, although a bot ugly
import re, itertools
strs = raw_input("enter a string list ")
print [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
gets
>>>
enter a string list here there (x y ) thereagain "there there"
['here there ', 'x y ', ' thereagain ', 'there there']
This is doing what you expect
import re, itertools
strs = raw_input("enter a string list ")
res1 = [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
set1 = re.search(r'\"(.*)\"', strs).groups()
set2 = re.search(r'\((.*)\)', strs).groups()
print [k for k in res1 if k in list(set1) or k in list(set2) ]
+ list(itertools.chain(*[k.split() for k in res1 if k
not in set1 and k not in set2 ]))
For python 3.6 - 3.8
I had a similar question, however I like none of those answers, maybe because most of them are from 2013. So I elaborated my own solution.
regex = r'\(.+?\)|".+?"|\w+'
test = 'Hello Test (Test1 test2) (Hello1 hello2) other_stuff'
result = re.findall(regex, test)
Here you are looking for three different groups:
Something that is included inside (); parenthesis should be written together with backslashes
Something that is included inside ""
Just words
The use of ? makes your search lazy instead of greedy
Related
I am trying to remove all the single characters in a string
input: "This is a big car and it has a spacious seats"
my output should be:
output: "This is big car and it has spacious seats"
Here I am using the expression
import re
re.compile('\b(?<=)[a-z](?=)\b')
This matches with first single character in the string ...
Any help would be appreciated ...thanks in Advance
Edit: I have just seen that this was suggested in the comments first by Wiktor Stribiżew. Credit to him - I had not seen when this was posted.
You can also use re.sub() to automatically remove single characters (assuming you only want to remove alphabetical characters). The following will replace any occurrences of a single alphabetical character:
import re
input = "This is a big car and it has a spacious seats"
output = re.sub(r"\b[a-zA-Z]\b", "", input)
>>>
output = "This is big car and it has spacious seats"
You can learn more about inputting regex expression when replacing strings here: How to input a regex in string.replace?
Here's one way to do it by splitting the string and filtering out single length letters using len and str.isalpha:
>>> s = "1 . This is a big car and it has a spacious seats"
>>> ' '.join(i for i in s.split() if not (i.isalpha() and len(i)==1))
'1 . This is big car and it has spacious seats'
re.sub(r' \w{1} |^\w{1} | \w{1}$', ' ', input)
EDIT:
You can use:
import re
input_string = "This is a big car and it has a spacious seats"
str_without_single_chars = re.sub(r'(?:^| )\w(?:$| )', ' ', input_string).strip()
or (which as was brought to my attention, doesn't meet the specifications):
input_string = "This is a big car and it has a spacious seats"
' '.join(w for w in input_string.split() if len(w)>3)
The fastest way to remove words, characters, strings or anything between two known tags or two known characters in a string is by using a direct and Native C approach using RE along with a Common as shown below.
var = re.sub('<script>', '<!--', var)
var = re.sub('</script>', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '', var)
It removes everything and works faster, better and cleaner than Beautiful Soup.
Batch files are where the "" got there beginnings and were only borrowed for use with batch and html from native C". When using all Pythonic methods with regular expressions you have to realize that Python has not altered or changed much from all regular expressions used by Machine Language so why iterate many times when a single loop can find it all as one chunk in one iteration? Do the same individually with Characters also.
var = re.sub('\[', '<!--', var)
var = re.sub('\]', '-->', var)
And finally
var = re.sub('<!--.*?-->', '' var)# wipes it all out from between along with.
And you do not need Beautiful Soup. You can also scalp data using them if you understand how this works.
Let's say I have a string that looks like this:
myStr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
What I would like to obtain in the end would be:
myStr_l1 = '(Txt_l1) or (Txt2_l1)'
and
myStr_l2 = '(Txt_l2) or (Txt2_l2)'
Some properties:
all "Txt_"-elements of the string start with an uppercase letter
the string can contain much more elements (so there could also be Txt3, Txt4,...)
the suffixes '_l1' and '_l2' look different in reality; they cannot be used for matching (I chose them for demonstration purposes)
I found a way to get the first part done by using:
myStr_l1 = re.sub('\(\w+\)','',myStr)
which gives me
'(Txt_l1 ) or (Txt2_l1 )'
However, I don't know how to obtain myStr_l2. My idea was to remove everything between two open parentheses. But when I do something like this:
re.sub('\(w+\(', '', myStr)
the entire string is returned.
re.sub('\(.*\(', '', myStr)
removes - of course - far too much and gives me
'Txt2_l2))'
Does anyone have an idea how to get myStr_l2?
When there is an "and" instead of an "or", the strings look slightly different:
myStr2 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2))'
Then I can still use the command from above:
re.sub('\(\w+\)','',myStr2)
which gives:
'(Txt_l1 and Txt2_l1 )'
but I again fail to get myStr2_l2. How would I do this for these kind of strings?
And how would one then do this for mixed expressions with "and" and "or" e.g. like this:
myStr3 = '(Txt_l1 (Txt_l2) and Txt2_l1 (Txt2_l2)) or (Txt3_l1 (Txt3_l2) and Txt4_l1 (Txt2_l2))'
re.sub('\(\w+\)','',myStr3)
gives me
'(Txt_l1 and Txt2_l1 ) or (Txt3_l1 and Txt4_l1 )'
but again: How would I obtain myStr3_l2?
Regexp is not powerful enough for nested expressions (in your case: nested elements in parentheses). You will have to write a parser. Look at https://pyparsing.wikispaces.com/
I'm not entirely sure what you want but I wrote this to strip everything between the parenthesis.
import re
mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
sets = mystr.split(' or ')
noParens = []
for line in sets:
mat = re.match(r'\((.* )\((.*\)\))', line, re.M)
if mat:
noParens.append(mat.group(1))
noParens.append(mat.group(2).replace(')',''))
print(noParens)
This takes all the parenthesis away and puts your elements in a list. Here's an alternate way of doing it without using Regular Expressions.
mystr = '(Txt_l1 (Txt_l2)) or (Txt2_l1 (Txt2_l2))'
noParens = []
mystr = mystr.replace(' or ', ' ')
mystr = mystr.replace(')','')
mystr = mystr.replace('(','')
noParens = mystr.split()
print(noParens)
I have many fill-in-the-blank sentences in strings,
e.g. "6d) We took no [pains] to hide it ."
How can I efficiently parse this string (in Python) to be
"We took no to hide it"?
I also would like to be able to store the word in brackets (e.g. "pains") in a list for use later. I think the regex module could be better than Python string operations like split().
This will give you all the words inside the brackets.
import re
s="6d) We took no [pains] to hide it ."
matches = re.findall('\[(.*?)\]', s)
Then you can run this to remove all bracketed words.
re.sub('\[(.*?)\]', '', s)
just for fun (to do the gather and substitution in one iteration)
matches = []
def subber(m):
matches.append(m.groups()[0])
return ""
new_text = re.sub("\[(.*?)\]",subber,s)
print new_text
print matches
import re
s = 'this is [test] string'
m = re.search(r"\[([A-Za-z0-9_]+)\]", s)
print m.group(1)
Output
'test'
For your example you could use this regex:
(.*\))(.+)\[(.+)\](.+)
You will get four groups that you can use to create your resulting string and save the 3. group for later use:
6d)
We took no
pains
to hide it .
I used .+ here because I don't know if your strings always look like your example. You can change the .+ to alphanumeric or sth. more special to your case.
import re
s = '6d) We took no [pains] to hide it .'
m = re.search(r"(.*\))(.+)\[(.+)\](.+)", s)
print(m.group(2) + m.group(4)) # "We took no to hide it ."
print(m.group(3)) # pains
import re
m = re.search(".*\) (.*)\[.*\] (.*)","6d) We took no [pains] to hide it .")
if m:
g = m.groups()
print g[0] + g[1]
Output :
We took no to hide it .
I am trying to write code that will take a string and remove specific data from it. I know that the data will look like the line below, and I only need the data within the " " marks, not the marks themselves.
inputString = 'type="NN" span="123..145" confidence="1.0" '
Is there a way to take a Substring of a string within two characters to know the start and stop points?
You can extract all the text between pairs of " characters using regular expressions:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
pat=re.compile('"([^"]*)"')
while True:
mat=pat.search(inputString)
if mat is None:
break
strings.append(mat.group(1))
inputString=inputString[mat.end():]
print strings
or, easier:
import re
inputString='type="NN" span="123..145" confidence="1.0" '
strings=re.findall('"([^"]*)"', inputString)
print strings
Output for both versions:
['NN', '123..145', '1.0']
fields = inputString.split('"')
print fields[1], fields[3], fields[5]
You could split the string at each space to get a list of 'key="value"' substrings and then use regular expressions to parse the substrings.
Using your input string:
>>> input_string = 'type="NN" span="123..145" confidence="1.0" '
>>> input_string_split = input_string.split()
>>> print input_string_split
[ 'type="NN"', 'span="123..145"', 'confidence="1.0"' ]
Then use regular expressions:
>>> import re
>>> pattern = r'"([^"]+)"'
>>> for substring in input_string_split:
match_obj = search(pattern, substring)
print match_obj.group(1)
NN
123..145
1.0
The regular expression '"([^"]+)"' matches anything within quotation marks (provided there is at least one character). The round brackets indicate the bit of the regular expression that you are interested in.
How to define a function that takes a string (sentence) and inserts an extra space after a period if the period is directly followed by a letter.
sent = "This is a test.Start testing!"
def normal(sent):
list_of_words = sent.split()
...
This should print out
"This is a test. Start testing!"
I suppose I should use split() to brake a string into a list, but what next?
P.S. The solution has to be as simple as possible.
Use re.sub. Your regular expression will match a period (\.) followed by a letter ([a-zA-Z]). Your replacement string will contain a reference to the second group (\2), which was the letter matched in the regular expression.
>>> import re
>>> re.sub(r'\.([a-zA-Z])', r'. \1', 'This is a test.This is a test. 4.5 balloons.')
'This is a test. This is a test. 4.5 balloons'
Note the choice of [a-zA-Z] for the regular expression. This matches just letters. We do not use \w because it would insert spaces into a decimal number.
One-liner non-regex answer:
def normal(sent):
return ".".join(" " + s if i > 0 and s[0].isalpha() else s for i, s in enumerate(sent.split(".")))
Here is a multi-line version using a similar approach. You may find it more readable.
def normal(sent):
sent = sent.split(".")
result = sent[:1]
for item in sent[1:]:
if item[0].isalpha():
item = " " + item
result.append(item)
return ".".join(result)
Using a regex is probably the better way, though.
Brute force without any checks:
>>> sent = "This is a test.Start testing!"
>>> k = sent.split('.')
>>> ". ".join(l)
'This is a test. Start testing!'
>>>
For removing spaces:
>>> sent = "This is a test. Start testing!"
>>> k = sent.split('.')
>>> l = [x.lstrip(' ') for x in k]
>>> ". ".join(l)
'This is a test. Start testing!'
>>>
Another regex-based solution, might be a tiny bit faster than Steven's (only one pattern match, and a blacklist instead of a whitelist):
import re
re.sub(r'\.([^\s])', r'. \1', some_string)
Improving pyfunc's answer:
sent="This is a test.Start testing!"
k=sent.split('.')
k='. '.join(k)
k.replace('. ','. ')
'This is a test. Start testing!'