How to extract only Numbers from the value $1,632.50 (BigQuery) - python

I would like to extract only numbers before the decimal point.
for example -> $1,632.50
I would like it to return 1632.
the current regex I have (r'[0-9]+') doesn't fetch the correct value if there is a comma associated with the value.
example --> $1,632.50 it returns 1
but for ---> $500.00 it returns 500
It works fine in this case
I am new to regex. Any help is appreciated
PS: I am currently using Bigquery and
I only have REGEX_EXTRACT AND REGEX_REPLACE to work with.
Most of the solutions here work on a normal python script but I still can't get it to work on BigQuery

Below is for BigQuery Standard SQL
REGEXP_REPLACE(str, r'\..*|[^0-9]', '')
As you can see here is only one REGEXP_REPLACE does the work
You can test, play with it using dummy data as below
#standardSQL
WITH t AS (
SELECT '$1,632.50' AS str UNION ALL
SELECT '$500.00'
)
SELECT
str,
REGEXP_REPLACE(str, r'\..*|[^0-9]', '') AS extracted_number
FROM t
with result
Row str extracted_number
1 $1,632.50 1632
2 $500.00 500

Your regex matches the first group of digits. It stops at comma. Seems difficult to do that only with one regex.
So search for digits and comma, then replace comma by nothing using str.replace, convert to integer:
import re
s = "$1,632.50"
result = int(re.search("([\d,]+)",s).group(1).replace(",",""))
(doesn't work for $.50, but you can use other tricks, like for instance replace $ by $0 before starting to make sure that there's a 0 after the $)

I think the simplest solution is just to use re.sub.
Example:
import re
result = re.sub(r'[^\d.]', '', '$1,234.56')
This replaces all of the non-numbers and a . with nothing, leaving just numbers including the decimal.

Your regex [0-9]+ matches 1+ times a digit and will not match a comma. It is also not taking the dollar sign into account.
What you might do is match a dollar sign, capture in a group 1+ digits and an optional part that matches a comma and 1+ digits. Then, from that group replace the comma with an empty string.
\$(\d+(?:,\d+)?)
Explanation
\$ Match $
( Capturing group
\d+ Match 1+ digits
(?:,\d+)? Optional capturing group that matches a comma and 1+ digits
) Close Capturing group
Regex demo

In BigQuery, you can combine the two functions:
select regexp_replace(regexp_extract(str, '[^.]+'), '[^0-9]', '')
from (select '$1,632.50' as str) x

This seems to work pretty well: r'(\d{,3})?[.,]?(\d{3})?'. Testing it out:
import re
pattern = r'(\d{,3})?[.,]?(\d{3})?'
tests = ['1,234.50',
'456.7',
'12']
for t in tests:
print(''.join([g for g in re.match(pattern, t).groups() if g is not None]))
# 1234
# 456
# 12
Unfortunately you run into an issue with repeated groupings. It appears that the re package does not support repeating subgroup captures. In those cases, you should probably use a string replace.
Breaking down the regex:
pattern = """ ( # begin capture group
\d{,3} # up to three digits
) # end capture group
? # zero or one of these first groups of digits
[.,]? # zero or one period or comma (not captured)
( # begin capture group inside of the non-capture group
\d{3} # exactly three digits
) # end capture group
? # zero or one of these
"""
You could probably simplify this a bit, but the big thing is you capture each group of three digits (treating the first differently because it can be up to three) separated by optional commas. To put them all together, simply use ''.join([g for g in re.match(pattern, my_string).groups() if g is not None]) as in the example code.

One way to do this in Python without regexp is to extract the part of the string that falls between the dollar sign and decimal, then use replace to remove any commas found inside.
s = "My price is: $1,632.50"
extracted = s[s.find('$')+1:s.find('.')].replace(',', '')
print(extracted)
Here's the same sort of thing with a regexp:
# Look for the first dollar sign, followed by any mix of digits and
# commas, and stop when you've found (if any) character after that
# which isn't a comma or digit. So both "$1,234.50!" and "$1,234!"
# for example should give back "1234".
result = re.search("(\$)([\d,]+)([^,\d]*)", s)
print(re.sub(',', '', result.group(2)))
re.sub here isn't much different than using a string .replace, but it's technically a way to do it using "only" regexps.

Related

Regex python - find match items on list that have the same digit between the second character "_" to character "."

I have the following list :
list_paths=imgs/foldeer/img_ABC_21389_1.tif.tif,
imgs/foldeer/img_ABC_15431_10.tif.tif,
imgs/foldeer/img_GHC_561321_2.tif.tif,
imgs_foldeer/img_BCL_871125_21.tif.tif,
...
I want to be able to run a for loop to match string with specific number,which is the number between the second occurance of "_" to the ".tif.tif", for example, when number is 1, the string to be matched is "imgs/foldeer/img_ABC_21389_1.tif.tif" , for number 2, the match string will be "imgs/foldeer/img_GHC_561321_2.tif.tif".
For that, I wanted to use regex expression. Based on this answer, I have tested this regex expression on Regex101:
[^\r\n_]+\.[^\r\n_]+\_([0-9])
But this doesn't match anything, and also doesn't make sure that it will take the exact number, so if number is 1, it might also select items with number 10 .
My end goal is to be able to match items in the list that have the request number between the 2nd occurrence of "_" to the first occirance of ".tif" , using regex expression, looking for help with the regex expression.
EDIT: The output should be the whole path and not only the number.
Your pattern [^\r\n_]+\.[^\r\n_]+\_([0-9]) does not match anything, because you are matching an underscore \_ (note that you don't have to escape it) after matching a dot, and that does not occur in the example data.
Then you want to match a digit, but the available digits only occur before any of the dots.
In your question, the numbers that you are referring to are after the 3rd occurrence of the _
What you could do to get the path(s) is to make the number a variable for the number you want to find:
^\S*?/(?:[^\s_/]+_){3}\d+\.tif\b[^\s/]*$
Explanation
\S*? Match optional non whitespace characters, as few as possible
/ Match literally
(?:[^\s_/]+_){3} Match 3 times (non consecutive) _
\d+ Match 1+ digits
\.tif\b[^\s/]* Match .tif followed by any char except /
$ End of string
See a regex demo and a Python demo.
Example using a list comprehension to return all paths for the given number:
import re
number = 10
pattern = rf"^\S*?/(?:[^\s_/]+_){{3}}{number}\.tif\b[^\s/]*$"
list_paths = [
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif",
"imgs_foldeer/img_BCL_871125_21.png.tif"
]
res = [lp for lp in list_paths if re.search(pattern, lp)]
print(res)
Output
['imgs/foldeer/img_ABC_15431_10.tif.tif']
I'll show you something working and equally ugly as regex which I hate:
data = ["imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_21389_1.tif.tif",
"imgs/foldeer/img_ABC_15431_10.tif.tif",
"imgs/foldeer/img_GHC_561321_2.tif.tif",
"imgs_foldeer/img_BCL_871125_21.tif.tif"]
numbers = [int(x.split("_",3)[-1].split(".")[0]) for x in data]
First split gives ".tif.tif"
extract the last element
split again by the dot this time, take the first element (thats your number as a string), cast it to int
Please keep in mind it's gonna work only for the format you provided, no flexibility at all in this solution (on the other hand regex doesn't give any neither)
without regex if allowed.
import re
s= 'imgs/foldeer/img_ABC_15431_10.tif.tif'
last =s[s.rindex('_')+1:]
print(re.findall(r'\d+', last)[0])
Gives #
10
[0-9]*(?=\.tif\.tif)
This regex expression uses a lookahead to capture the last set of numbers (what you're looking for)
Try this:
import re
s = '''imgs/foldeer/img_ABC_21389_1.tif.tif
imgs/foldeer/img_ABC_15431_10.tif.tif
imgs/foldeer/img_GHC_561321_2.tif.tif
imgs_foldeer/img_BCL_871125_21.tif.tif'''
number = 1
res1 = re.findall(f".*_{number}\.tif.*", s)
number = 21
res21 = re.findall(f".*_{number}\.tif.*", s)
print(res1)
print(res21)
Results
['imgs/foldeer/img_ABC_21389_1.tif.tif']
['imgs_foldeer/img_BCL_871125_21.tif.tif']

Regex to find N characters between underscore and period

I have a filename having numerals like test_20200331_2020041612345678.csv.
So I just want to read only first 8 characters from the number between last underscore and .csv using a regex.
For e.g: From the file name test_20200331_2020041612345678.csv --> i want to read only 20200416 using regex.
Regex tried: (?<=_)(\d+)(?=\.)
But it is returning the full number between underscore and period i.e 2020041612345678
Also, when tried quantifier like (?<=_)(\d{8})(?=\.) its not matching with any string
The (?<=_)(\d{8})(?=\.) does not work because the (?=\.) positive lookahead requires the presence of a . char immediately to the right of the current location, i.e. right after the eigth digit, but there are more digits in between.
You may add \d* before \. to match any amount of digits after the required 8 digits, use
(?<=_)\d{8}(?=\d*\.)
Or, with a capturing group, you do not even need lookarounds (just make sure you access Group 1 when a match is obtained):
_(\d{8})\d*\.
See the regex demo
Python demo:
import re
s = "test_20200331_2020041612345678.csv"
m = re.search(r"(?<=_)\d{8}(?=\d*\.)", s)
# m = re.search(r"_(\d{8})\d*\.", s) # capturing group approach
if m:
print(m.group()) # => 20200416
# print(m.group(1)) # capturing group approach

How to make this Regex pattern work for both strings

I have the strings 'amount $165' and 'amount on 04/20' (and a few other variations I have no issues with so far). I want to be able to run an expression and return the numerical amount IF available (in the first string it is 165) and return nothing if it is not available AND make sure not to confuse with a date (second string). If I write the code as following, it returns the 165 but it also returns 04 from the second.
amount_search = re.findall(r'amount.*?(\d+)[^\d?/]?, string)
If I write it as following, it includes neither
amount_search = re.findall(r'amount.*?(\d+)[^\d?/], string)
How to change what I have to return 165 but not 04?
To capture the whole number in a group, you could match amount followed by matching all chars except digits or newlines if the value can not cross newline boundaries.
Capture the first encountered digits in a group and assert a whitespace boundary at the right.
\bamount [^\d\r\n]*(\d+)(?!\S)
In parts
\bamount Match amount followed by a space and preceded with a word boundary
[^\d\r\n]* Match 0 or more times any char except a digit or newlines
(\d+) Capture group 1, match 1 or more digits
(?!\S) Assert a whitespace boundary on the right
Regex demo
try this ^amount\W*\$([\d]{1,})$
the $ indicate end of line, for what I have tested, use .* or ? also work.
by grouping the digits, you can eliminate the / inside the date format.
hope this helps :)
Try this:
from re import sub
your_digit_list = [int(sub(r'[^0-9]', '', s)) for s in str.split() if s.lstrip('$').isdigit()]

Capturing repeated pattern in Python

I'm trying to implement some kind of markdown like behavior for a Python log formatter.
Let's take this string as example:
**This is a warning**: Virus manager __failed__
A few regexes later the string has lost the markdown like syntax and been turned into bash code:
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
But that should be compressed to
\033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m
I tried these, beside many other non working solutions:
(\\033\[([\d]+)m){2,} => Capture: \033[33m\033[1m with g1 '\033[1m' and g2 '1' and \033[0m\033[0mwith g1 '\033[0m' and g2 '0'
(\\033\[([\d]+)m)+ many results, not ok
(?:(\\033\[([\d]+)m)+) many results, although this is the recommended way for repeated patterns if I understood correctly, not ok
and others..
My goal is to have as results:
Input
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
Output
Match 1
033[33m\033[1m
Group1: 33
Group2: 1
Match 2
033[0m\033[0m
Group1: 0
Group2: 0
In other words, capture the ones that are "duplicated" and not the ones alone, so I can fuse them with a regex sub.
You want to match consectuively repeating \033[\d+m chunks of text and join the numbers after [ with a semi-colon.
You may use
re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text)
See the Python demo online
The (?:\\033\[\d+m){2,} pattern will match two or more sequences of \033[ + one or more digits + m chunks of texts and then, the match will be passed to the lambda expression, where the output will be: 1) \033[, 2) all the numbers after [ extracted with re.findall(r"\[(\d+)", m.group()) and deduplicated with the set, and then 3) m.
The patterns in the string to be modified have not been made clear from the question. For example, is 033 fixed or might it be 025 or even 25? I've made certain assumptions in using the regex
r" ^(\\0(\d+)\[\2)[a-z]\\0\2\[(\d[a-z].+)
to obtain two capture groups that are to be combined, separated by a semi-colon. I've attempted to make clear my assumptions below, in part to help the OP modify this regex to satisfy alternative requirements.
Demo
The regex performs the following operations:
^ # match beginning of line
( # begin cap grp 1
\\0 # match '\0'
(\d+) # match 1+ digits in cap grp 2
\[ # match '['
\2 # match contents of cap grp 2
) # end cap grp 1
[a-z] # match a lc letter
\\0 # match '\0'
\2 # match contents of cap grp 2
\[ # match '['
(\d[a-z].+) # match a digit, then lc letter then 1+ chars to the
# end of the line in cap grp 3
As you see, the portion of the string captured in group 1 is
\033[33
I've assumed that the part of this string that is now 033 must be two or more digits beginning with a zero, and the second appearance of a string of digits consists of the same digits after the zero. This is done by capturing the digits following '0' (33) in capture group 2 and then using a back-reference \2.
The next part of the string is to be replaced and therefore is not captured:
m\\033[
I've assumed that m must be one lower case letter (or should it be a literal m?), the backslash and zero and required and the following digits must again match the content of capture group 2.
The remainder of the string,
1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
is captured in capture group 3. Here I've assumed it begins with one digit (perhaps it should be \d+) followed by one lower case letter that needn't be the same as the lower case letter matched earlier (though that could be enforced with another capture group). At that point I match the remainder of the line with .+, having given up matching patterns in that part of the string.
One may alternatively have just two capture groups, the capture group that is now #2, becoming #1, and #2 being the part of the string that is to be replaced with a semicolon.
This is pretty straightforward for the cases you desribe here; simply write out from left to right what you want to match and capture. Repeating capturing blocks won't help you here, because only the most recently captured values would be returned as a result.
\\033\[(\d+)m\\033\[(\d+)m

Check in python if self designed pattern matches

I have a pattern which looks like:
abc*_def(##)
and i want to look if this matches for some strings.
E.x. it matches for:
abc1_def23
abc10_def99
but does not match for:
abc9_def9
So the * stands for a number which can have one or more digits.
The # stands for a number with one digit
I want the value in the parenthesis as result
What would be the easiest and simplest solution for this problem?
Replace the * and # through regex expression and then look if they match?
Like this:
pattern = pattern.replace('*', '[0-9]*')
pattern = pattern.replace('#', '[0-9]')
pattern = '^' + pattern + '$'
Or program it myself?
Based on your requirements, I would go for a regex for the simple reason it's already available and tested, so it's easiest as you were asking.
The only "complicated" thing in your requirements is avoiding after def the same digit you have after abc.
This can be done with a negative backreference. The regex you can use is:
\babc(\d+)_def((?!\1)\d{1,2})\b
\b captures word boundaries; if you enclose your regex between two \b
you will restrict your search to words, i.e. text delimited by space,
punctuations etc
abc captures the string abc
\d+ captures one or more digits; if there is an upper limit to the number of digits you want, it has to be \d{1,MAX} where MAX is your maximum number of digits; anyway \d stands for a digit and + indicates 1 or more repetitions
(\d+) is a group: the use of parenthesis defines \d+ as something you want to "remember" inside your regex; it's somehow similar to defining a variable; in this case, (\d+) is your first group since you defined no other groups before it (i.e. to its left)
_def captures the string _def
(?!\1) is the part where you say "I don't want to repeat the first group after _def. \1 represents the first group, while (?!whatever) is a check that results positive is what follows the current position is NOT (the negation is given by !) whatever you want to negate.
Live demo here.
I had the hardest time getting this to work. The trick was the $
#!python2
import re
yourlist = ['abc1_def23', 'abc10_def99', 'abc9_def9', 'abc955_def9', 'abc_def9', 'abc9_def9288', 'abc49_def9234']
for item in yourlist:
if re.search(r'abc[0-9]+_def[0-9][0-9]$', item):
print item, 'is a match'
You could match your pattern like:
abc\d+_def(\d{2})
abc Match literally
\d+ Match 1 or more digits
_ Match underscore
def - Match literally
( Capturing group (Your 2 digits will be in this group)
\d{2} Match 2 digits
) Close capturing group
Then you could for example use search to check for a match and use .group(1) to get the digits between parenthesis.
Demo Python
You could also add word boundaries:
\babc\d+_def(\d{2})\b

Categories