Trying to get multiple digits even if its on the same line.
this is what I have so far
import re
lines = '''
1 2
3
4
G3434343 '''
x = re.findall('^\d{1,2}',lines,re.MULTILINE)
print(x)
the output I am getting is:
[1,3,4]
The output I want to get is:
[1,2,3,4]
not sure what to try next any ideas ? Keep in mind the numbers can also be two digits here's another input example
8 9
11
15
G54444
the output for the above should be
[8,9,11,15]
You use the ^ which anchors the regex to the beginning of the line. Remove that to find e.g., the "2".
Then you'll also get the "34"s from the "G3434343" which you don't seem to want. So you need to tell the regex that you need to have a word boundary in front and after the digit(s) by using \b. (Adjust this as needed, it's not clear from your examples how you decide which digits you want.)
However, that \b gets interpreted as a backspace, and nothing matches. So you need to change the string to a raw string with a leading 'r' or escape all backslashes with another backslash.
import re
lines = '''
1 2
3
4
G3434343 '''
x = re.findall(r'\b\d{1,2}\b', lines, re.MULTILINE)
print(x)
This prints:
['1', '2', '3', '4']
You can try with lookarounds:
(?<=\s)\d{1,2}(?=\s)
Then map every match of every line to an integer:
[list(map(int, re.findall('(?<=\s)\d{1,2}(?=\s)', line))) for line in lines]
Output:
[[1,2,3,4], [8,9,11,15]]
Check the Python demo here.
Related
I have always 2 numbers in between and I want to extract everything before 3 so Salvatore and everything after 2 Abdulla
For example I have the following:
txt = "Salvatore32Abdulla"
first = re.findall("^\D+", txt)
last = re.search(,txt)
Expected result:
first = 'Salvatore'
last = 'Abdulla'
I can get the first part, but after 2 I can't get the last part
You could also do this in a single line by slightly changing the solution suggested by #ctwheels as follows. I would suggest you to use re.findall as that gets the job done with a single blow.
import re
txt = "Salvatore32Abdulla"
Option-1
Single line extraction of the non-numeric parts.
first, last = re.findall("\D+", txt)
print((first, last))
('Salvatore', 'Abdulla')
Option-2
If you would (for some reason) also want to keep track of the number in between:
first, num, last = re.findall("(\D+)(\d{2})(\D+)", txt)
print((first, num, last))
('Salvatore', '32', 'Abdulla')
Option-3
As an extension of Option-2 and considering the text with a form 'Salvatore####...###Abdulla', where ####...### denotes a continuous block of digits separating the non-numeric parts and you may or may not have any idea of how many digits could be in-between, you could use the following:
first, num, last = re.findall("(\D+)(\d*)(\D+)", txt)
print((first, num, last))
('Salvatore', '32', 'Abdulla')
Why am I not getting the expected results?
You currently have one issue with your regex and one with your code.
Your regex contains ^, which anchors it to the start of the string. This will only allow you to match Salvatore. You're using findall (which is the appropriate choice if you change the regex to simply \D+), but right now it's only getting one result.
The second re.search call is not needed as you can capture first and last with the findall given an appropriate pattern (see below).
How do I fix it?
See code in use here
import re
txt = "Salvatore32Abdulla"
x = re.findall("\D+", txt)
print(x)
Result:
['Salvatore', 'Abdulla']
You could use a regex like this:
txt = "Salvatore32Abdulla"
regex = r"(\D+)\d\d(\D+)"
match = re.match(regex, txt)
first = match.group(1)
last = match.group(2)
Part after last digit:
match = re.search(r'\D+$',txt)
if match:
print(match.group())
See Python proof | regex proof.
Results: Abdulla
EXPLANATION
--------------------------------------------------------------------------------
\D+ non-digits (all but 0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
Stuck with the following issue:
I have a string 'ABC.123.456XX' and I want to use regex to extract the 3 numeric characters that come after the second period. Really struggling with this and would appreciate any new insights, this is the closest I got but its not really close to what I want:
'.*\.(.*?\.\d{3})'
I appreciate any help in advance - thanks.
If your input will always be in a similar format, like xxx.xxx.xxxxx, then one solution is string manipulation:
>>> s = 'ABC.123.456XX'
>>> '.'.join(s.split('.')[2:])[0:3]
Explanation
In the line '.'.join(s.split('.')[2:])[0:3]:
s.split('.') splits the string into the list ['ABC', '123', '456XX']
'.'.join(s.split('.')[2:]) joins the remainder of the list after the second element, so '456XX'
[0:3] selects the substring from index 0 to index 2 (inclusive), so the result is 456
This expression might also work just OK:
[^\r\n.]+\.[^\r\n.]+\.([0-9]{3})
Test
import re
regex = r'[^\r\n.]+\.[^\r\n.]+\.([0-9]{3})'
string = '''
ABC.123.456XX
ABCOUOU.123123123.000871XX
ABCanything_else.123123123.111871XX
'''
print(re.findall(regex, string))
Output
['456', '000', '111']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
Dot, not-Dot twice then the 3 digits follow in capture group 1
[^.]*(?:\.[^.]*){2}(\d{3})
https://regex101.com/r/qWpfHx/1
Expanded
[^.]*
(?: \. [^.]* ){2}
( \d{3} ) # (1)
What I'm trying to do is extract only the digits from dollar figures.
Format of Input
...
$1,289,868
$62,000
$421
...
Desired Output
...
1289868
62000
421
...
The regular expression that I was using to extract only the digits and commas is:
r'\d+(,\d+){0,}'
which of course outputs...
...
1,289,868
62,000
421
...
What I'd like to do is convert the output to an integer (int(...)), but obviously this won't work with the commas. I'm sure I could figure this out on my own, but I'm running really short on time right now.
I know I can simply use r'\d+', but this obviously separates each chunk into separate matches...
You can't match discontinuous texts within one match operation. You can't put a regex into re.findall against 1,345,456 to receive 1345456. You will need to first match the strings you need, and then post-process them within code.
A regex you may use to extract the numbers themselves
re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)
See this regex demo.
Alternatively, you may use a bit more general regex to be used with re.findall:
r'\$(\d+(?:,\d+)*)'
See this regex demo.
Note that re.findall will only return the captured part of the string (the one matched with the (...) part in the regex).
Details
\$ - a dollar sign
(\d{1,3}(?:,\d{3})*) - Capturing group 1:
\d{1,3} - 1 to 3 digits (if \d+ is used, 1 or more digits)
(?:,\d{3})* - 0 or more sequences of
, - a comma
\d{3} - 3 digits (or if \d+ is used, 1 or more digits).
Python code sample (with removing commas):
import re
s = """$1,289,868
$62,000
$421"""
result = [x.replace(",", "") for x in re.findall(r'\$(\d{1,3}(?:,\d{3})*)', s)]
print(result) # => ['1289868', '62000', '421']
Using re.sub
Ex:
import re
s = """$1,289,868
$62,000
$421"""
print([int(i) for i in re.sub(r'[^0-9\s]', "", s).splitlines()])
Output:
[1289868, 62000, 421]
You don't need regex for this.
int(''.join(filter(str.isdigit, "$1,000,000")))
works just fine.
If you did want to use regex for some reason:
int(''.join(re.findall(r"\d", "$1,000,000")))
If you know how to extract the numbers with comma groupings, the easiest thing to do is just transform that into something int can handle:
for match in matches:
i = int(match.replace(',', ''))
For example, if match is '1,289,868', then match.replace(',', '') is '1289868', and obviously int(<that>) is 1289868.
You dont need regex for this. Just string operations should be enough
>>> string = '$1,289,868\n$62,000\n$421'
>>> [w.lstrip('$').replace(',', '') for w in string.splitlines()]
['1289868', '62000', '421']
Or alternatively, you can use locale.atoi to convert string of digits with commas to int
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF8')
>>> list(map(lambda x: locale.atoi(x.lstrip('$')), string.splitlines()))
[1289868, 62000, 421]
I have the following string:
ref:_00D30jPy._50038vQl5C:ref
And would like to formalize the following output string:
5003800000vQl5C
The required regex actions are:
Remove all leading characters until the digit '5'.
Add 5 zeros starting the fifth digit.
Remove the closing ':ref'.
I initially made the following regex to match the whole string:
(ref:(\S+):ref)
How can I alter the Python RegEx to achieve the above?
Use re.sub:
import re
s = 'ref:_00D30jPy._50038vQl5C:ref'
result = re.sub(r'^[^5]*(5.{4})(.*?):ref$', r'\g<1>00000\g<2>', s, 0, re.MULTILINE)
print(result)
Output:
5003800000vQl5C
Explanation:
^[^5]*: match characters except 5 from the beginning
(5.{4}): capture the first 5 characters to group 1
(.*?):ref$: capture the remaining to group 2 except the :ref at the end
\g<1>00000\g<2>: replace the whole line with \g<1>00000\g<2> where \g<1> and \g<2> are substituted by group 1 and 2 repsectively.
Demo has a Python 2-compatible code generator and detailed explanation.
regex is not required for this task. It can be achieved more simply using string slicing.
If the input strings maintain the same format and lengths you can simply do this:
s = 'ref:_00D30jPy._50038vQl5C:ref'
new = '{}00000{}'.format(s[15:20], s[20:-4])
If there is some variability then search for the first '5' in the string and slice from there:
start = s.index('5')
new = '{}00000{}'.format(s[start:start+5], s[start+5:-4])
I have this polynomial in a string.
x^3+0.125x+2
I want to match here the 3 and the 2, but not the 0.125. Just the integers. Be best I came with so far is this, but this still matches the 25 in 0.125.
(?<!\.)\d+(?!\.)
You can try this:
>>> import re
>>> re.findall(r'(?<!\.)\b\d+\b(?!\.)', "x^3+0.125x+2")
['3', '2']
use \b\d+\b to make sure that matching entire number
An integer is a number that contains only digits, an optional e or E (only if followed by numbers) and optionally starts with a -. To the left there can only be a non-number and non-letter (since x2 would be considered a variable name) or nothing. To the right there can only be a non-number or nothing (2x on the right would be 2*x).
The following pattern should match all integers in a string according to the given specification:
r'(?:^|(?<=[^\d\w\.]))(?:(?:(?<![\d\w])|^)\-)?\d+(?:[eE]\d+)?(?!\.)(?=[^\d]|$)''