Extracting a number from an unspaced string in Python - python

I need to extracted a number from an unspaced string that has the number in brakets for example:
"auxiliary[0]"
The only way I can think of is:
def extract_num(s):
s1=s.split["["]
s2=s1[1].split["]"]
return int(s2[0])
Which seems very clumsy, does any one know of a better way to do it? (The number is always in "[ ]" brakets)

You could use a regular expression (with the built-in re module):
import re
bracketed_number = re.compile(r'\[(\d+)\]')
def extract_num(s):
return int(bracketed_number.search(s).group(1))
The pattern matches a literal [ character, followed by 1 or more digits (the \d escape signifies the digits character group, + means 1 or more), followed by a literal ]. By putting parenthesis around the \d+ part, we create a capturing group, which we can extract by calling .group(1) ("get the first capturing group result").
Result:
>>> extract_num("auxiliary[0]")
0
>>> extract_num("foobar[42]")
42

I would use a regular expression to get the number. See docs: http://docs.python.org/2/library/re.html
Something like:
import re
def extract_num(s):
m = re.search('\[(\d+)\]', s)
return int(m.group(1))

print a[-2]
print a[a.index(']') - 1]
print a[a.index('[') + 1]

for number in re.findall(r'\[(\d+)\]',"auxiliary[0]"):
do_sth(number)

Related

Extract Only Digits from Dollar Figures

What I'm trying to do is extract only the digits from dollar figures.
Format of Input
...
$1,289,868
$62,000
$421
...
Desired Output
...
1289868
62000
421
...
The regular expression that I was using to extract only the digits and commas is:
r'\d+(,\d+){0,}'
which of course outputs...
...
1,289,868
62,000
421
...
What I'd like to do is convert the output to an integer (int(...)), but obviously this won't work with the commas. I'm sure I could figure this out on my own, but I'm running really short on time right now.
I know I can simply use r'\d+', but this obviously separates each chunk into separate matches...
You can't match discontinuous texts within one match operation. You can't put a regex into re.findall against 1,345,456 to receive 1345456. You will need to first match the strings you need, and then post-process them within code.
A regex you may use to extract the numbers themselves
re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)
See this regex demo.
Alternatively, you may use a bit more general regex to be used with re.findall:
r'\$(\d+(?:,\d+)*)'
See this regex demo.
Note that re.findall will only return the captured part of the string (the one matched with the (...) part in the regex).
Details
\$ - a dollar sign
(\d{1,3}(?:,\d{3})*) - Capturing group 1:
\d{1,3} - 1 to 3 digits (if \d+ is used, 1 or more digits)
(?:,\d{3})* - 0 or more sequences of
, - a comma
\d{3} - 3 digits (or if \d+ is used, 1 or more digits).
Python code sample (with removing commas):
import re
s = """$1,289,868
$62,000
$421"""
result = [x.replace(",", "") for x in re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)]
print(result) # => ['1289868', '62000', '421']
Using re.sub
Ex:
import re
s = """$1,289,868
$62,000
$421"""
print([int(i) for i in re.sub(r'[^0-9\s]', "", s).splitlines()])
Output:
[1289868, 62000, 421]
You don't need regex for this.
int(''.join(filter(str.isdigit, "$1,000,000")))
works just fine.
If you did want to use regex for some reason:
int(''.join(re.findall(r"\d", "$1,000,000")))
If you know how to extract the numbers with comma groupings, the easiest thing to do is just transform that into something int can handle:
for match in matches:
i = int(match.replace(',', ''))
For example, if match is '1,289,868', then match.replace(',', '') is '1289868', and obviously int(<that>) is 1289868.
You dont need regex for this. Just string operations should be enough
>>> string = '$1,289,868\n$62,000\n$421'
>>> [w.lstrip('$').replace(',', '') for w in string.splitlines()]
['1289868', '62000', '421']
Or alternatively, you can use locale.atoi to convert string of digits with commas to int
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF8')
>>> list(map(lambda x: locale.atoi(x.lstrip('$')), string.splitlines()))
[1289868, 62000, 421]

How to return regular expression match as one entire string?

I want to match phone numbers, and return the entire phone number but only the digits. Here's an example:
(555)-555-5555
555.555.5555
But I want to use regular expressions to return only:
5555555555
But, for some reason I can't get the digits to be returned:
import re
phone_number='(555)-555-5555'
regex = re.compile('[0-9]')
r = regex.search(phone_number)
regex.match(phone_number)
print r.groups()
But for some reason it just prints an empty tuple? What is the obvious thing I am missing here? Thanks.
You're getting empty result because you don't have any capturing groups, refer to the documentation for details.
You should change it to group() instead, now you'll get the first digit as a match. But this is not what you want because the engine stops when it encounter a non digit character and return the match until there.
You can simply remove all non-numeric characters:
re.sub('[^0-9]', '', '(555)-555-5555')
The range 0-9 is negated, so the regex matches anything that's not a digit, then it replaces it with the empty string.
You can do it without as regular expression using str.join and str.isdigit:
s = "(555)-555-5555"
print("".join([ch for ch in s if ch.isdigit()]))
5555555555
If you printed r.group() you would get some output but using search is not the correct way to find all the matches, search would return the first match and since you are only looking for a single digit it would return 5, even with '[0-9]+') to match one or more you would still only get the first group of consecutive digits i.e 555 in the string above. Using "".join(r.findall(s)) would get the digits but that can obviously be done with str.digit.
If you knew the potential non-digit chars then str.translate would be the best approach:
s = "(555)-555-5555"
print(s.translate(None,"()-."))
5555555555
The simplest way is here:
>>> import re
>>> s = "(555)-555-5555"
>>> x = re.sub(r"\D+", r"", s)
>>> x
'5555555555'

How do I properly use the sub function of regular expression in Python?

So I have a number like 7.50x, which I want to convert to 7.5x. I thought about using regular expressions. I can easily match this expression, for example by using re.search('[0-9].[0-9]0x', string). However, I'm confused how to replace every such number using the re.sub method. For example what should be there as the second argument?
re.sub('[0-9].[0-9]0x', ?, string)
re.sub(r'([0-9]\.[0-9])0x', r'\1x', num)
Test
>>> import re
>>> num="7.50x"
>>> re.sub(r'([0-9]\.[0-9])0x', r'\1x', num)
'7.5x'
r'\1x' here \1 is the value saved from the first capturing group, ([0-9]\.[0-9])
eg for input 7.50x the capturing group matches 7.5 which saved in \1
0+(?![1-9])(?=[^.]*$)
Try this.See demo.
http://regex101.com/r/hQ9xT1/14
x=7.50x
re.sub(r"0+(?![1-9])(?=[^.]*$)","",x)
Using positive lookahead and lookbehind assertion.
>>> import re
>>> num="7.50x"
>>> re.sub(r'(?<=\d\.\d)0(?=x)', r'', num)
'7.5x'
(?<=\d\.\d), the number which precedes the digit 0 would be in this digit dot digit format.
And the character following the match (0) must be x
\. Matches a literal dot.

How to ignore characters with Python Regex

I am wondering if there is a better Python Regex solution for the one that I currently have? Currently my code is:
import re
n = '+17021234567'
m = '7021234567'
match = re.search(r'(?:\+1)?(\d{10})', n)
match.group(1)
match = re.search(r'(?:\+1)?(\d{10})', m)
match.group(1)
The goal of the code is to only extract the 10 digit ph # if it has a leading +1 or not. Currently it works, but I am wondering is there a way to just call match.group() to get the 10 digit ph # without calling match.group(1)?
No, without the use of capturing groups, it couldn't be possible through re.match function. Since re.match tries to match the input string from the beginning. But it could be possible through re.search
>>> re.search(r'\d{10}$', n).group()
'7021234567'
>>> re.search(r'\d{10}$', m).group()
'7021234567'
you want to only capture digit use '\d' for digit
n = '+17021234567'
re.findall('\d{10}$',n)
use this pattern
(?<=^|\+1)\d{10}$
Demo
(?<= look behind to see if there is:
^ the beginning of the string
| OR
\+1 '+1'
) end of look-behind
\d{10} digits (0-9) (10 times)
$ before an optional \n, and the end of the string

python regex: get end digits from a string

I am quite new to python and regex (regex newbie here), and I have the following simple string:
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
I would like to extract only the last digits in the above string i.e 767980716 and I was wondering how I could achieve this using python regex.
I wanted to do something similar along the lines of:
re.compile(r"""-(.*?)""").search(str(s)).group(1)
indicating that I want to find the stuff in between (.*?) which starts with a "-" and ends at the end of string - but this returns nothing..
I was wondering if anyone could point me in the right direction..
Thanks.
You can use re.match to find only the characters:
>>> import re
>>> s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
>>> re.match('.*?([0-9]+)$', s).group(1)
'767980716'
Alternatively, re.finditer works just as well:
>>> next(re.finditer(r'\d+$', s)).group(0)
'767980716'
Explanation of all regexp components:
.*? is a non-greedy match and consumes only as much as possible (a greedy match would consume everything except for the last digit).
[0-9] and \d are two different ways of capturing digits. Note that the latter also matches digits in other writing schemes, like ୪ or ൨.
Parentheses (()) make the content of the expression a group, which can be retrieved with group(1) (or 2 for the second group, 0 for the whole match).
+ means multiple entries (at least one number at the end).
$ matches only the end of the input.
Nice and simple with findall:
import re
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
print re.findall('^.*-([0-9]+)$',s)
>>> ['767980716']
Regex Explanation:
^ # Match the start of the string
.* # Followed by anthing
- # Upto the last hyphen
([0-9]+) # Capture the digits after the hyphen
$ # Upto the end of the string
Or more simply just match the digits followed at the end of the string '([0-9]+)$'
Your Regex should be (\d+)$.
\d+ is used to match digit (one or more)
$ is used to match at the end of string.
So, your code should be: -
>>> s = "99-my-name-is-John-Smith-6376827-%^-1-2-767980716"
>>> import re
>>> re.compile(r'(\d+)$').search(s).group(1)
'767980716'
And you don't need to use str function here, as s is already a string.
Use the below regex
\d+$
$ depicts the end of string..
\d is a digit
+ matches the preceding character 1 to many times
Save the regular expressions for something that requires more heavy lifting.
>>> def parse_last_digits(line): return line.split('-')[-1]
>>> s = parse_last_digits(r"99-my-name-is-John-Smith-6376827-%^-1-2-767980716")
>>> s
'767980716'
I have been playing around with several of these solutions, but many seem to fail if there are no numeric digits at the end of the string. The following code should work.
import re
W = input("Enter a string:")
if re.match('.*?([0-9]+)$', W)== None:
last_digits = "None"
else:
last_digits = re.match('.*?([0-9]+)$', W).group(1)
print("Last digits of "+W+" are "+last_digits)
Try using \d+$ instead. That matches one or more numeric characters followed by the end of the string.

Categories