How to make this Regex pattern work for both strings - python

I have the strings 'amount $165' and 'amount on 04/20' (and a few other variations I have no issues with so far). I want to be able to run an expression and return the numerical amount IF available (in the first string it is 165) and return nothing if it is not available AND make sure not to confuse with a date (second string). If I write the code as following, it returns the 165 but it also returns 04 from the second.
amount_search = re.findall(r'amount.*?(\d+)[^\d?/]?, string)
If I write it as following, it includes neither
amount_search = re.findall(r'amount.*?(\d+)[^\d?/], string)
How to change what I have to return 165 but not 04?

To capture the whole number in a group, you could match amount followed by matching all chars except digits or newlines if the value can not cross newline boundaries.
Capture the first encountered digits in a group and assert a whitespace boundary at the right.
\bamount [^\d\r\n]*(\d+)(?!\S)
In parts
\bamount Match amount followed by a space and preceded with a word boundary
[^\d\r\n]* Match 0 or more times any char except a digit or newlines
(\d+) Capture group 1, match 1 or more digits
(?!\S) Assert a whitespace boundary on the right
Regex demo

try this ^amount\W*\$([\d]{1,})$
the $ indicate end of line, for what I have tested, use .* or ? also work.
by grouping the digits, you can eliminate the / inside the date format.
hope this helps :)

Try this:
from re import sub
your_digit_list = [int(sub(r'[^0-9]', '', s)) for s in str.split() if s.lstrip('$').isdigit()]

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!
The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)
You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

Wny it does not give all positive numbers in the string? Regex in Python

I don't understand why it only gives 125, the first number only, why it does not give all positive numbers in that string? My goal is to extract all positive numbers.
import re
pattern = re.compile(r"^[+]?\d+")
text = "125 -898 8969 4788 -2 158 -947 599"
matches = pattern.finditer(text)
for match in matches:
print(match)
Try using the regular expression
-\d+|(\d+)
Disregard the matches. The strings representing non-negative integers are saved in capture group 1.
Demo
The idea is to match but not save to a capture group what you don't want (negative numbers), and both match and save to a capture group what you do want (non-negative numbers).
The regex attempts to match -\d+. If that succeeds the regex engine's internal string pointer is moved to just after the last digit matched. If -\d+ is not matched an attempt is made to match the second part of the alternation (following |). If \d+ is matched the match is saved to capture group 1.
Any plus signs in the string can be disregarded.
For a fuller description of this technique see The Greatest Regex Trick Ever. (Search for "Tarzan"|(Tarzan) to get to the punch line.)
The following pattern will only match non negative numbers:
pattern = re.compile("(?:^|[^\-\d])(\d+)")
pattern.findall(text)
OUTPUT
['125', '8969', '4788', '158', '599']
For the sake of completeness another idea by use of \b and a lookbehind.
\b(?<!-)\d+
See this demo at regex101
Your pattern ^[+]?\d+ is anchored at the start of the string, and will give only that match at the beginning.
Another option is to assert a whitspace boundary to the left, and match the optional + followed by 1 or more digits.
(?<!\S)\+?\d+\b
(?<!\S) Assert a whitespace boundary to the left
\+? Match an optional +
\d+\b Match 1 or more digits followed by a word bounadry
Regex demo
Use , to sperate the numbers in the string.

Capturing repeated pattern in Python

I'm trying to implement some kind of markdown like behavior for a Python log formatter.
Let's take this string as example:
**This is a warning**: Virus manager __failed__
A few regexes later the string has lost the markdown like syntax and been turned into bash code:
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
But that should be compressed to
\033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m
I tried these, beside many other non working solutions:
(\\033\[([\d]+)m){2,} => Capture: \033[33m\033[1m with g1 '\033[1m' and g2 '1' and \033[0m\033[0mwith g1 '\033[0m' and g2 '0'
(\\033\[([\d]+)m)+ many results, not ok
(?:(\\033\[([\d]+)m)+) many results, although this is the recommended way for repeated patterns if I understood correctly, not ok
and others..
My goal is to have as results:
Input
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
Output
Match 1
033[33m\033[1m
Group1: 33
Group2: 1
Match 2
033[0m\033[0m
Group1: 0
Group2: 0
In other words, capture the ones that are "duplicated" and not the ones alone, so I can fuse them with a regex sub.
You want to match consectuively repeating \033[\d+m chunks of text and join the numbers after [ with a semi-colon.
You may use
re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text)
See the Python demo online
The (?:\\033\[\d+m){2,} pattern will match two or more sequences of \033[ + one or more digits + m chunks of texts and then, the match will be passed to the lambda expression, where the output will be: 1) \033[, 2) all the numbers after [ extracted with re.findall(r"\[(\d+)", m.group()) and deduplicated with the set, and then 3) m.
The patterns in the string to be modified have not been made clear from the question. For example, is 033 fixed or might it be 025 or even 25? I've made certain assumptions in using the regex
r" ^(\\0(\d+)\[\2)[a-z]\\0\2\[(\d[a-z].+)
to obtain two capture groups that are to be combined, separated by a semi-colon. I've attempted to make clear my assumptions below, in part to help the OP modify this regex to satisfy alternative requirements.
Demo
The regex performs the following operations:
^ # match beginning of line
( # begin cap grp 1
\\0 # match '\0'
(\d+) # match 1+ digits in cap grp 2
\[ # match '['
\2 # match contents of cap grp 2
) # end cap grp 1
[a-z] # match a lc letter
\\0 # match '\0'
\2 # match contents of cap grp 2
\[ # match '['
(\d[a-z].+) # match a digit, then lc letter then 1+ chars to the
# end of the line in cap grp 3
As you see, the portion of the string captured in group 1 is
\033[33
I've assumed that the part of this string that is now 033 must be two or more digits beginning with a zero, and the second appearance of a string of digits consists of the same digits after the zero. This is done by capturing the digits following '0' (33) in capture group 2 and then using a back-reference \2.
The next part of the string is to be replaced and therefore is not captured:
m\\033[
I've assumed that m must be one lower case letter (or should it be a literal m?), the backslash and zero and required and the following digits must again match the content of capture group 2.
The remainder of the string,
1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
is captured in capture group 3. Here I've assumed it begins with one digit (perhaps it should be \d+) followed by one lower case letter that needn't be the same as the lower case letter matched earlier (though that could be enforced with another capture group). At that point I match the remainder of the line with .+, having given up matching patterns in that part of the string.
One may alternatively have just two capture groups, the capture group that is now #2, becoming #1, and #2 being the part of the string that is to be replaced with a semicolon.
This is pretty straightforward for the cases you desribe here; simply write out from left to right what you want to match and capture. Repeating capturing blocks won't help you here, because only the most recently captured values would be returned as a result.
\\033\[(\d+)m\\033\[(\d+)m

How to extract only Numbers from the value $1,632.50 (BigQuery)

I would like to extract only numbers before the decimal point.
for example -> $1,632.50
I would like it to return 1632.
the current regex I have (r'[0-9]+') doesn't fetch the correct value if there is a comma associated with the value.
example --> $1,632.50 it returns 1
but for ---> $500.00 it returns 500
It works fine in this case
I am new to regex. Any help is appreciated
PS: I am currently using Bigquery and
I only have REGEX_EXTRACT AND REGEX_REPLACE to work with.
Most of the solutions here work on a normal python script but I still can't get it to work on BigQuery
Below is for BigQuery Standard SQL
REGEXP_REPLACE(str, r'\..*|[^0-9]', '')
As you can see here is only one REGEXP_REPLACE does the work
You can test, play with it using dummy data as below
#standardSQL
WITH t AS (
SELECT '$1,632.50' AS str UNION ALL
SELECT '$500.00'
)
SELECT
str,
REGEXP_REPLACE(str, r'\..*|[^0-9]', '') AS extracted_number
FROM t
with result
Row str extracted_number
1 $1,632.50 1632
2 $500.00 500
Your regex matches the first group of digits. It stops at comma. Seems difficult to do that only with one regex.
So search for digits and comma, then replace comma by nothing using str.replace, convert to integer:
import re
s = "$1,632.50"
result = int(re.search("([\d,]+)",s).group(1).replace(",",""))
(doesn't work for $.50, but you can use other tricks, like for instance replace $ by $0 before starting to make sure that there's a 0 after the $)
I think the simplest solution is just to use re.sub.
Example:
import re
result = re.sub(r'[^\d.]', '', '$1,234.56')
This replaces all of the non-numbers and a . with nothing, leaving just numbers including the decimal.
Your regex [0-9]+ matches 1+ times a digit and will not match a comma. It is also not taking the dollar sign into account.
What you might do is match a dollar sign, capture in a group 1+ digits and an optional part that matches a comma and 1+ digits. Then, from that group replace the comma with an empty string.
\$(\d+(?:,\d+)?)
Explanation
\$ Match $
( Capturing group
\d+ Match 1+ digits
(?:,\d+)? Optional capturing group that matches a comma and 1+ digits
) Close Capturing group
Regex demo
In BigQuery, you can combine the two functions:
select regexp_replace(regexp_extract(str, '[^.]+'), '[^0-9]', '')
from (select '$1,632.50' as str) x
This seems to work pretty well: r'(\d{,3})?[.,]?(\d{3})?'. Testing it out:
import re
pattern = r'(\d{,3})?[.,]?(\d{3})?'
tests = ['1,234.50',
'456.7',
'12']
for t in tests:
print(''.join([g for g in re.match(pattern, t).groups() if g is not None]))
# 1234
# 456
# 12
Unfortunately you run into an issue with repeated groupings. It appears that the re package does not support repeating subgroup captures. In those cases, you should probably use a string replace.
Breaking down the regex:
pattern = """ ( # begin capture group
\d{,3} # up to three digits
) # end capture group
? # zero or one of these first groups of digits
[.,]? # zero or one period or comma (not captured)
( # begin capture group inside of the non-capture group
\d{3} # exactly three digits
) # end capture group
? # zero or one of these
"""
You could probably simplify this a bit, but the big thing is you capture each group of three digits (treating the first differently because it can be up to three) separated by optional commas. To put them all together, simply use ''.join([g for g in re.match(pattern, my_string).groups() if g is not None]) as in the example code.
One way to do this in Python without regexp is to extract the part of the string that falls between the dollar sign and decimal, then use replace to remove any commas found inside.
s = "My price is: $1,632.50"
extracted = s[s.find('$')+1:s.find('.')].replace(',', '')
print(extracted)
Here's the same sort of thing with a regexp:
# Look for the first dollar sign, followed by any mix of digits and
# commas, and stop when you've found (if any) character after that
# which isn't a comma or digit. So both "$1,234.50!" and "$1,234!"
# for example should give back "1234".
result = re.search("(\$)([\d,]+)([^,\d]*)", s)
print(re.sub(',', '', result.group(2)))
re.sub here isn't much different than using a string .replace, but it's technically a way to do it using "only" regexps.

Check in python if self designed pattern matches

I have a pattern which looks like:
abc*_def(##)
and i want to look if this matches for some strings.
E.x. it matches for:
abc1_def23
abc10_def99
but does not match for:
abc9_def9
So the * stands for a number which can have one or more digits.
The # stands for a number with one digit
I want the value in the parenthesis as result
What would be the easiest and simplest solution for this problem?
Replace the * and # through regex expression and then look if they match?
Like this:
pattern = pattern.replace('*', '[0-9]*')
pattern = pattern.replace('#', '[0-9]')
pattern = '^' + pattern + '$'
Or program it myself?
Based on your requirements, I would go for a regex for the simple reason it's already available and tested, so it's easiest as you were asking.
The only "complicated" thing in your requirements is avoiding after def the same digit you have after abc.
This can be done with a negative backreference. The regex you can use is:
\babc(\d+)_def((?!\1)\d{1,2})\b
\b captures word boundaries; if you enclose your regex between two \b
you will restrict your search to words, i.e. text delimited by space,
punctuations etc
abc captures the string abc
\d+ captures one or more digits; if there is an upper limit to the number of digits you want, it has to be \d{1,MAX} where MAX is your maximum number of digits; anyway \d stands for a digit and + indicates 1 or more repetitions
(\d+) is a group: the use of parenthesis defines \d+ as something you want to "remember" inside your regex; it's somehow similar to defining a variable; in this case, (\d+) is your first group since you defined no other groups before it (i.e. to its left)
_def captures the string _def
(?!\1) is the part where you say "I don't want to repeat the first group after _def. \1 represents the first group, while (?!whatever) is a check that results positive is what follows the current position is NOT (the negation is given by !) whatever you want to negate.
Live demo here.
I had the hardest time getting this to work. The trick was the $
#!python2
import re
yourlist = ['abc1_def23', 'abc10_def99', 'abc9_def9', 'abc955_def9', 'abc_def9', 'abc9_def9288', 'abc49_def9234']
for item in yourlist:
if re.search(r'abc[0-9]+_def[0-9][0-9]$', item):
print item, 'is a match'
You could match your pattern like:
abc\d+_def(\d{2})
abc Match literally
\d+ Match 1 or more digits
_ Match underscore
def - Match literally
( Capturing group (Your 2 digits will be in this group)
\d{2} Match 2 digits
) Close capturing group
Then you could for example use search to check for a match and use .group(1) to get the digits between parenthesis.
Demo Python
You could also add word boundaries:
\babc\d+_def(\d{2})\b

Categories