Regex to Select All Math Operator not Hyphen - python

I am trying to implement a regex which splits the string on all math operators but no hyphen in the string:
dummy_string= "I Dont_Know The-Meaning_2018-You Know_Meaning_2017+You Know_Meaning_2017"
string_list = re.split("[0-9][+/*\-][A-Za-z]", dummy_string)
print(string_list)
>>['I Dont_Know The-Meaning_201', 'ou Know_Meaning_201', 'ou Know_Meaning_2017']
Expected Output:
>>['I Dont_Know The-Meaning_2018', 'You Know_Meaning_201', 'You Know_Meaning_2017']
I am using regex (re) package for this.

You may use (?<=[0-9]) and (?=[A-Za-z]) lookarounds instead of consuming patterns:
import re
dummy_string= "I Dont_Know The-Meaning_2018-You Know_Meaning_2017+You Know_Meaning_2017"
string_list = re.split("(?<=[0-9])[+/*-](?=[A-Za-z])", dummy_string)
print(string_list)
# => ['I Dont_Know The-Meaning_2018', 'You Know_Meaning_2017', 'You Know_Meaning_2017']
See the Python demo
When you use [0-9][+/*\-][A-Za-z] to split a string, the digit before a non-word delimiter and a letter after it are consumed, i.e. added to the match value, and re.split removes this text from the resulting output. When using lookarounds, the matched texts remain "unconsumed", they are not added to the match value and thus remain in the re.split output.
Note that you do not have to escape - when it is at the end of the character class, [+/*-] = [+/*\-]. If you plan to add more chars into the class, you may keep - escaped to avoid further issues.

Related

Match a piece of text from the beginning up to the first occurrence of multicharacter substring

I want a regex search to end when it reaches ". ", but not when it reaches "."; I'm aware of using [^...] to exclude single characters, and have been using this to stop my search when it reaches a certain character. This does not work with strings though, as [^. ] stops when it reaches either character. Say I've got the code
import re
def main():
my_string = "The value of the float is 2.5. The int's value is 2.\n"
re.search("[^.]*", my_string)
main()
Which gives a match object with the string
"The value of the float is 2"
How can I change this so that it only stops after the string ". "?
Bonus question, is there any way to tell regex to stop whenever it reaches one of multiple strings? Using the above code as an example, if I wanted the search to end when it found the string ". " or the string ".\n", how would I go about it? Thanks!
To match from the start of a string till the . followed with whitespace, use
^(.*?)\.\s
If you want to only require a space or newline after a dot, use either of (the second is best if you have single chars only, use alternation if there are multicharacter alternatives)
^(.*?)\.(?: |\n)
^(.*?)\.[ \n]
See the regex demo.
Details
^ - start of a string
(.*?) - Capturing group 1: any 0+ chars other than linebreak chars, as few as possible
\. - a literal . char
\s - a whitespace char
(?: |\n) / [ \n] - a non-capturing group matching either a space or (|) a newline.
Python demo:
import re
my_string = "The value of the float is 2.5. The int's value is 2.\n"
m = re.search("^(.*?)\.\s", my_string) # Try to find a match
if m: # If there is a match
print(m.group(1)) # Show Group 1 value
NOTE If there can be line breaks in the input, pass re.S or re.DOTALL flag:
m = re.search("^(.*?)\.\s", my_string, re.DOTALL)
Besides classic approach explained by Wiktor, also splitting may be interesting solution in this case.
>>> my_string
"The value of the float is 2.5. The int's value is 2.\n"
>>> re.split('\. |\.\n', my_string)
['The value of the float is 2.5', "The int's value is 2", '']
If you want to include periods at the end of the sentence, you can do something like this:
['{}.'.format(sentence) for sentence in re.split('\. |\.\n', my_string) if sentence]
To handle multiple empty spaces between the sentences:
>>> str2 = "The value of the float is 2.5. The int's value is 2.\n\n "
>>> ['{}.'.format(sentence)
for sentence in re.split('\. \s*|\.\n\s*', str2)
if sentence
]
['The value of the float is 2.5.', "The int's value is 2."]

Replace strings in a string by a substring of those strings

Let's say I have a string like this:
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
and I want to turn it into
'(xy09 and foobar or (abc123 and something))'
then - in this particular case - I could simply do
s.replace('X_', "")
which gives the desired output.
However, in my actual data there might be not only X_ but also other prefixes, so the above replace statement does not work.
What I would need instead is a replacement of
a capital letter followed by an underscore and an arbitrary sequence of letters and numbers
by
everything after the first underscore.
So, to extract the desired elements I could use:
import re
print(re.findall('[A-Z]{1}_[a-zA-Z0-9]+', s))
which prints
['X_xy09', 'X_foobar', 'X_abc123', 'X_something']
how can I now replace those elements so that I obtain
'(xy09 and foobar or (abc123 and something))'
?
If you need to remove an uppercase ASCII letter with an underscore after it, only when not preceded with a word char and when followed with an alphanumeric char, you may use
import re
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
print(re.sub(r'\b[A-Z]_([a-zA-Z0-9])', r'\1', s))
See the Python demo and a regex demo.
Pattern details
\b - a leading word boundary
[A-Z]_ - an ASCII uppercase letter and _
([a-zA-Z0-9]) - Group 1 (later referenced to with \1 from the replacement pattern): 1 alphanumeric char.
If you just need to replace a capital letter followed by an underscore, you can use the regular expression r'[A-Z]_'.
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
re.sub(r'[A-Z]_', '', s)
You may need to add to it if you have other criteria not mentioned. (For example, some of your target values follow a word boundary and some follow parentheses.) The above might give you the wrong output if you have input like XY_something. It depends on what you expect the output to be.
Another re.sub() approach:
import re
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
result = re.sub(r'[A-Z]_(?=[a-zA-Z0-9]+)', '', s)
print(result)
The output:
(xy09 and foobar or (abc123 and something))
[A-Z]_(?=[a-zA-Z0-9]+) - (?=...) positive lookahead assertion, ensures that substituted [A-Z]_ substring is followed by alphanumeric sequence [a-zA-Z0-9]+
You could use re.sub() with a lookahead assertion:
>>> import re
>>> s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
>>> re.sub(r'\b[A-Z]_(?=[a-zA-Z0-9])', '', s)
'(xy09 and foobar or (abc123 and something))'
from the docs:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

How to remove substrings marked with special characters from a string?

I have a string in Python:
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a nummber."
print Tt
'This is a <"string">string, It should be <"changed">changed to <"a">a nummber.'
You see the some words repeat in this part <\" \">.
My question is, how to delete those repeated parts (delimited with the named characters)?
The result should be like:
'This is a string, It should be changed to a nummber.'
Use regular expressions:
import re
Tt = re.sub('<\".*?\">', '', Tt)
Note the ? after *. It makes the expression non-greedy,
so it tries to match so few symbols between <\" and \"> as possible.
The Solution of James will work only in cases when the delimiting substrings
consist only from one character (< and >). In this case it is possible to use negations like [^>]. If you want to remove a substring delimited with character sequences (e.g. with begin and end), you should use non-greedy regular expressions (i.e. .*?).
I'd use a quick regular expression:
import re
Tt = "This is a <\"string\">string, It should be <\"changed\">changed to <\"a\">a number."
print re.sub("<[^<]+>","",Tt)
#Out: This is a string, It should be changed to a nummber.
Ah - similar to Igor's post, he beat my by a bit. Rather than making the expression non-greedy, I don't match an expression if it contains another start tag "<" in it, so it will only match a start tag that's followed by an end tag ">".

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

Match single quotes from python re

How to match the following i want all the names with in the single quotes
This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'
How to extract the name within the single quotes only
name = re.compile(r'^\'+\w+\'')
The following regex finds all single words enclosed in quotes:
In [6]: re.findall(r"'(\w+)'", s)
Out[6]: ['Tom', 'Harry', 'rock']
Here:
the ' matches a single quote;
the \w+ matches one or more word characters;
the ' matches a single quote;
the parentheses form a capture group: they define the part of the match that gets returned by findall().
If you only wish to find words that start with a capital letter, the regex can be modified like so:
In [7]: re.findall(r"'([A-Z]\w*)'", s)
Out[7]: ['Tom', 'Harry']
I'd suggest
r = re.compile(r"\B'\w+'\B")
apos = r.findall("This hasn't been much that much of a twist and turn's to 'Tom','Harry' and u know who..yes its 'rock'")
Result:
>>> apos
["'Tom'", "'Harry'", "'rock'"]
The "negative word boundaries" (\B) prevent matches like the 'n' in words like Rock'n'Roll.
Explanation:
\B # make sure that we're not at a word boundary
' # match a quote
\w+ # match one or more alphanumeric characters
' # match a quote
\B # make sure that we're not at a word boundary
^ ('hat' or 'caret', among other names) in regex means "start of the string" (or, given particular options, "start of a line"), which you don't care about. Omitting it makes your regex work fine:
>>> re.findall(r'\'+\w+\'', s)
["'Tom'", "'Harry'", "'rock'"]
The regexes others have suggested might be better for what you're trying to achieve, this is the minimal change to fix your problem.
Your regex can only match a pattern following the start of the string. Try something like: r"'([^']*)'"

Categories