I have a sentence like this
s = " zero/NN divided/VBD by/IN anything/NN is zero/NN"
I need to replace all the words with tags to just tags . Output should be
s = "NN VBD IN NN is NN"
I tried using regex replace like this
tup = re.sub( r"\s*/$" , "", s)
but this is not giving me the correct output . Please help
This gives the output you want:
tup = re.sub( r"\b\w+/" , "", s)
\b is matching a word boundary, followed by \w+ at least one word character (a-zA-Z0-9_) and at least the slash.
try:
tup = re.sub( r"[a-z]*/" , "", s)
In [1]: s = " zero/NN divided/VBD by/IN anything/NN is zero/NN"
In [2]: tup = re.sub( r"[a-z]*/" , "", s)
In [3]: print tup
NN VBD IN NN is NN
The \s character group matches all whitespace characters, which doesn't seem what you want. I think you want the other case, all non-whitespace characters. You can also be more specific on what is a tag, for example:
tup = re.sub( r"\S+/([A-Z]+)" , r"\1", s)
This replaces all non-whitespace characters, followed by a slash and then a sequence of uppercase letters with just the uppercase letters.
tup = re.sub( r"\b\w+/(\w+)\b", r"\1", s)
on either side of my regex is \b meaning "word boundary", then on either side of "/" i have \w+ meaning "word characters". On the right we group them by putting them into parentheses.
The second expression r"\1" means. "the first group" which gets the stuff in parentheses.
Related
In this problem I'm trying to find symbols and spaces between two alphanumeric characters. I am using regular expressions, but I cannot get result as I want. Any valuable tricks for this code is appreciated (only for regex solution):
import re
s = "This$#is% Matrix# %!"
regex_pattern = '\w(.[#_!#$%^&*()<>?/\|}{~:\s]*)\w' # needed to be solve
re.findall(regex_pattern, s)
Output is:
['h', '$#', '% ', 't']
Expected output is:
['$#', '% ']
You can try this simple pattern:
import re
s = "This$#is% Matrix# %!"
regex_pattern = '(?<=\w)[^\w]+?(?=\w)'
print(re.findall(regex_pattern, s))
Output:
['$#', '% ']
Basically, the pattern (?<=\w)[^\w]+?(?=\w) searches for clumps of all non-alphanumeric characters (that has to be at least one character in length) that are between 2 alphanumeric characters.
Using a regex find all approach:
s = "This$#is% Matrix# %!"
matches = re.findall(r'(?<=\w)[#_!#$%^&*()<>?/\|}{~:\s]+(?=\w)', s)
print(matches) # ['$#', '% ']
This approach is similar to yours, except that it simply searches for a sequence of symbols or whitespace characters which are surrounded on both sides by word characters.
Your regex uses quantifier * (0 or more) to match a series of non-alpha chars, so you get matches with no non-alpha characters between; you should use + to match one or more non-alpha chars:
import re
s = "This$#is% Matrix# %!"
regex_pattern = r'\w([#_!#$%^&*()<>?/\|}{~:\s]+)\w' # needed to be solve
print(re.findall(regex_pattern, s) )
Output:
['$#', '% ']
My 'trick' is is to use e.g. regex101.com to make sure the regex works before going to code, and to build up the regex a step at a time so you know when you add a step and the regex stops matching that it was the most recent step causing problems.
Your shortest solution is
import re
s = "This$#is% Matrix# %!"
regex_pattern = r'\b\W+\b'
print( re.findall(regex_pattern, s) ) # => ['$#', '% ']
See the online Python demo.
Why it works
\b - the word boundary followed with \W pattern matches a location that is right after a word char (i.e. a letter, digit or _)
\W+ - matches one or more non-word chars, the chars other than letters, digits and underscores
\b - right after \W, the word boundary matches a location that is immediately followed with a word char.
See the regex demo.
I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.
I have a text like this format,
s = '[aaa]foo[bbb]bar[ccc]foobar'
Actually the text is Chinese car review like this
【最满意】整车都很满意,最满意就是性价比,...【空间】空间真的超乎想象,毫不夸张,...【内饰】内饰还可以吧,没有多少可以说的...
Now I want to split it to these parts
[aaa]foo
[bbb]bar
[ccc]foobar
first I tried
>>> re.findall(r'\[.*?\].*?',s)
['[aaa]', '[bbb]', '[ccc]']
only got first half.
Then I tried
>>> re.findall(r'(\[.*?\].*?)\[?',s)
['[aaa]', '[bbb]', '[ccc]']
still only got first half
At last I have to get the two parts respectively then zip them
>>> re.findall(r'\[.*?\]',s)
['[aaa]', '[bbb]', '[ccc]']
>>> re.split(r'\[.*?\]',s)
['', 'foo', 'bar', 'foobar']
>>> for t in zip(re.findall(r'\[.*?\]',s),[e for e in re.split(r'\[.*?\]',s) if e]):
... print(''.join(t))
...
[aaa]foo
[bbb]bar
[ccc]foobar
So I want to know if exists some regex could directly split it to these parts?
One of the approaches:
import re
s = '[aaa]foo[bbb]bar[ccc]foobar'
result = re.findall(r'\[[^]]+\][^\[\]]+', s)
print(result)
The output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
\[ or \] - matches the bracket literally
[^]]+ - matches one or more characters except ]
[^\[\]]+ - matches any character(s) except brackets \[\]
I think this could work:
r'\[.+?\]\w+'
Here it is:
>>> re.findall(r"(\[\w*\]\w+)",s)
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Explanation:
parenthesis means the group to search. Witch group:
it should start by a braked \[ followed by some letters \w
then the matched braked braked \] followed by more letters \w
Notice you should to escape braked with \.
I think if input string format is "strict enough", it's possible to try something w/o regexp. It may look as a microoptimisation, but could be interesting as a challenge.
result = map(lambda x: '[' + x, s[1:].split("["))
So I tried to check performance on a 1Mil iterations and here are my results (seconds):
result = map(lambda x: '[' + x, s[1:].split("[")) # 0.89862203598
result = re.findall(r'\[[^]]+\][^\[\]]+', s) # 1.48306798935
result = re.findall(r'\[.+?\]\w+', s) # 1.47224497795
result = re.findall(r'(\[\w*\]\w+)', s) # 1.47370815277
\[.*?\][a-zA-Z]*
This regex should capture anything that start with [somethinghere]Any letters from a to Z
you can play on regex101 to try out different ones and it's easy to make your own regex there
All you need is findall and here is very simple pattern without making it complicated:
import re
print(re.findall(r'\[\w+\]\w+','[aaa]foo[bbb]bar[ccc]foobar'))
output:
['[aaa]foo', '[bbb]bar', '[ccc]foobar']
Detailed solution:
import re
string_1='[aaa]foo[bbb]bar[ccc]foobar'
pattern=r'\[\w+\]\w+'
print(re.findall(pattern,string_1))
explanation:
\[\w+\]\w+
\[ matches the character [ literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed
Let's say I have a string like this:
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
and I want to turn it into
'(xy09 and foobar or (abc123 and something))'
then - in this particular case - I could simply do
s.replace('X_', "")
which gives the desired output.
However, in my actual data there might be not only X_ but also other prefixes, so the above replace statement does not work.
What I would need instead is a replacement of
a capital letter followed by an underscore and an arbitrary sequence of letters and numbers
by
everything after the first underscore.
So, to extract the desired elements I could use:
import re
print(re.findall('[A-Z]{1}_[a-zA-Z0-9]+', s))
which prints
['X_xy09', 'X_foobar', 'X_abc123', 'X_something']
how can I now replace those elements so that I obtain
'(xy09 and foobar or (abc123 and something))'
?
If you need to remove an uppercase ASCII letter with an underscore after it, only when not preceded with a word char and when followed with an alphanumeric char, you may use
import re
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
print(re.sub(r'\b[A-Z]_([a-zA-Z0-9])', r'\1', s))
See the Python demo and a regex demo.
Pattern details
\b - a leading word boundary
[A-Z]_ - an ASCII uppercase letter and _
([a-zA-Z0-9]) - Group 1 (later referenced to with \1 from the replacement pattern): 1 alphanumeric char.
If you just need to replace a capital letter followed by an underscore, you can use the regular expression r'[A-Z]_'.
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
re.sub(r'[A-Z]_', '', s)
You may need to add to it if you have other criteria not mentioned. (For example, some of your target values follow a word boundary and some follow parentheses.) The above might give you the wrong output if you have input like XY_something. It depends on what you expect the output to be.
Another re.sub() approach:
import re
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
result = re.sub(r'[A-Z]_(?=[a-zA-Z0-9]+)', '', s)
print(result)
The output:
(xy09 and foobar or (abc123 and something))
[A-Z]_(?=[a-zA-Z0-9]+) - (?=...) positive lookahead assertion, ensures that substituted [A-Z]_ substring is followed by alphanumeric sequence [a-zA-Z0-9]+
You could use re.sub() with a lookahead assertion:
>>> import re
>>> s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
>>> re.sub(r'\b[A-Z]_(?=[a-zA-Z0-9])', '', s)
'(xy09 and foobar or (abc123 and something))'
from the docs:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
So, I am trying to find a word (a complete word) in a sentence. Lets say the sentence is
Str1 = "1. how are you doing"
and that I am interested in finding if
Str2 = "1."
is in it. If I do,
re.search(r"%s\b" % Str2, Str1, re.IGNORECASE)
it should say that a match was found, isn't it? but the re.search fails for this query. why?
There are two things wrong here:
\b matches a position between a word and a non-word character, so between any letter, digit or underscore, and a character that doesn't match that set.
You are trying to match the boundary between a . and a space; both are non-word characters and the \b anchor would never match there.
You are handing re a 1., which means 'match a 1 and any other character'. You'd need to escape the dot by using re.escape() to match a literal ..
The following works better:
re.search(r"%s(?:\s|$)" % re.escape(Str2), Str1, re.IGNORECASE)
Now it'll match your input literally, and look for a following space or the end of the string. The (?:...) creates a non-capturing group (always a good idea unless you specifically need to capture sections of the match); inside the group there is a | pipe to give two alternatives; either match \s (whitespace) or match $ (end of a line). You can expand this as needed.
Demo:
>>> import re
>>> Str1 = "1. how are you doing"
>>> Str2 = "1."
>>> re.search(r"%s(?:\s|$)" % re.escape(Str2), Str1, re.IGNORECASE)
<_sre.SRE_Match object at 0x10457eed0>
>>> _.group(0)
'1. '