regex capture numbers after varied lengths of spaces - python

I try to use a non-capturing group to detect the spaces (before the numbers I needed) and not to bring spaces into my result, so I use
(?: 1+)\d*.?\d*
to process my text:
input: kMPCV/epS4SgFoNdLo3LOuClO/URXS/5 134.686356921 2018-06-14 21:50:35.494
input: pRVh7kPpFbtmuwS1NILiCzwHUVwJ4NcK 839.680408921 2018-06-14 22:13:39.996
input: Ga7MIXmXAsrbaEc1Yj60qYYblcRQpnpz 4859.688276920 2018-06-14 23:02:11.125
input: 4mqdb5njytfDOFpgeG3XS0Iv1OXFPEnb 1400.684675920 2018-06-14 23:33:42.031
and try to get the numbers.
But line 2 and 3 returns None result and line 1 and 4 returns numbers with 1 space before it: " 134.686356921"
Why I get different results? Code is below:
import re
def calcprice(filename):
try:
print ('ok')
f = open(filename, 'r')
data = f.read()
rows = data.split('\n')
for row in rows:
print (re.search("[(?: 1+)\d*\.?\d*][1]",row))
except Exception as e:
print(e)
if __name__ == "__main__": ## If we are not importing this:
calcprice('dfk balance.txt')
Result:
<_sre.SRE_Match object; span=(52, 66), match=' 134.686356921'>
None
None
<_sre.SRE_Match object; span=(51, 66), match=' 1400.684675920'>

Your current regex is basically one big character set:
[(?: 1+)\d*\.?\d*]
which doesn't make much sense, looks like a misunderstanding of how regex works. If you want to match the numbers, it would probably make more sense to lookbehind for a couple spaces, match digits and periods, and lookahead for another couple spaces:
(?<= )[\d.]+(?= )
https://regex101.com/r/NRnXWb/1
for row in rows:
print (re.search(r"(?<= )[\d.]+(?= )",row))

Try the regex \b(\d+[\d\.]*)\b
Your regex doesn't align to what you're trying to do.. It's pretty erroneous.

Try this pattern: +(\d+(\.\d+)?) +.
Explanation: pattern will match number preceeded and followed by one or more spaces (+). It will match numbers with optional decimal part ((\.\d+)?), which will become second capturing group in a match (but you won't need it anyway).
In every match, first capturing group \1 will be your number.
Demo

Your regex [(?: 1+)\d*\.?\d*][1] consists or 2 times a character class.
If the number you want to match always contains a dot, you could use a word boundary and a positive lookahead to assert that what followes is a whitespace:
\b\d+\.\d+(?= )
If it could also be without a dot you could check for a leading and a trailing whitespace using lookrounds and make the part which will match a dot and one or more times a digit optional (?:\.\d+)?.
(?<= )\d+(?:\.\d+)?(?= )
Demo

Related

How to make this Regex pattern work for both strings

I have the strings 'amount $165' and 'amount on 04/20' (and a few other variations I have no issues with so far). I want to be able to run an expression and return the numerical amount IF available (in the first string it is 165) and return nothing if it is not available AND make sure not to confuse with a date (second string). If I write the code as following, it returns the 165 but it also returns 04 from the second.
amount_search = re.findall(r'amount.*?(\d+)[^\d?/]?, string)
If I write it as following, it includes neither
amount_search = re.findall(r'amount.*?(\d+)[^\d?/], string)
How to change what I have to return 165 but not 04?
To capture the whole number in a group, you could match amount followed by matching all chars except digits or newlines if the value can not cross newline boundaries.
Capture the first encountered digits in a group and assert a whitespace boundary at the right.
\bamount [^\d\r\n]*(\d+)(?!\S)
In parts
\bamount Match amount followed by a space and preceded with a word boundary
[^\d\r\n]* Match 0 or more times any char except a digit or newlines
(\d+) Capture group 1, match 1 or more digits
(?!\S) Assert a whitespace boundary on the right
Regex demo
try this ^amount\W*\$([\d]{1,})$
the $ indicate end of line, for what I have tested, use .* or ? also work.
by grouping the digits, you can eliminate the / inside the date format.
hope this helps :)
Try this:
from re import sub
your_digit_list = [int(sub(r'[^0-9]', '', s)) for s in str.split() if s.lstrip('$').isdigit()]

Get last part after number regex python

I have always 2 numbers in between and I want to extract everything before 3 so Salvatore and everything after 2 Abdulla
For example I have the following:
txt = "Salvatore32Abdulla"
first = re.findall("^\D+", txt)
last = re.search(,txt)
Expected result:
first = 'Salvatore'
last = 'Abdulla'
I can get the first part, but after 2 I can't get the last part
You could also do this in a single line by slightly changing the solution suggested by #ctwheels as follows. I would suggest you to use re.findall as that gets the job done with a single blow.
import re
txt = "Salvatore32Abdulla"
Option-1
Single line extraction of the non-numeric parts.
first, last = re.findall("\D+", txt)
print((first, last))
('Salvatore', 'Abdulla')
Option-2
If you would (for some reason) also want to keep track of the number in between:
first, num, last = re.findall("(\D+)(\d{2})(\D+)", txt)
print((first, num, last))
('Salvatore', '32', 'Abdulla')
Option-3
As an extension of Option-2 and considering the text with a form 'Salvatore####...###Abdulla', where ####...### denotes a continuous block of digits separating the non-numeric parts and you may or may not have any idea of how many digits could be in-between, you could use the following:
first, num, last = re.findall("(\D+)(\d*)(\D+)", txt)
print((first, num, last))
('Salvatore', '32', 'Abdulla')
Why am I not getting the expected results?
You currently have one issue with your regex and one with your code.
Your regex contains ^, which anchors it to the start of the string. This will only allow you to match Salvatore. You're using findall (which is the appropriate choice if you change the regex to simply \D+), but right now it's only getting one result.
The second re.search call is not needed as you can capture first and last with the findall given an appropriate pattern (see below).
How do I fix it?
See code in use here
import re
txt = "Salvatore32Abdulla"
x = re.findall("\D+", txt)
print(x)
Result:
['Salvatore', 'Abdulla']
You could use a regex like this:
txt = "Salvatore32Abdulla"
regex = r"(\D+)\d\d(\D+)"
match = re.match(regex, txt)
first = match.group(1)
last = match.group(2)
Part after last digit:
match = re.search(r'\D+$',txt)
if match:
print(match.group())
See Python proof | regex proof.
Results: Abdulla
EXPLANATION
--------------------------------------------------------------------------------
\D+ non-digits (all but 0-9) (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.
Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo
You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)
This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

python regex: get end digits from a string

I am quite new to python and regex (regex newbie here), and I have the following simple string:
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
I would like to extract only the last digits in the above string i.e 767980716 and I was wondering how I could achieve this using python regex.
I wanted to do something similar along the lines of:
re.compile(r"""-(.*?)""").search(str(s)).group(1)
indicating that I want to find the stuff in between (.*?) which starts with a "-" and ends at the end of string - but this returns nothing..
I was wondering if anyone could point me in the right direction..
Thanks.
You can use re.match to find only the characters:
>>> import re
>>> s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
>>> re.match('.*?([0-9]+)$', s).group(1)
'767980716'
Alternatively, re.finditer works just as well:
>>> next(re.finditer(r'\d+$', s)).group(0)
'767980716'
Explanation of all regexp components:
.*? is a non-greedy match and consumes only as much as possible (a greedy match would consume everything except for the last digit).
[0-9] and \d are two different ways of capturing digits. Note that the latter also matches digits in other writing schemes, like ୪ or ൨.
Parentheses (()) make the content of the expression a group, which can be retrieved with group(1) (or 2 for the second group, 0 for the whole match).
+ means multiple entries (at least one number at the end).
$ matches only the end of the input.
Nice and simple with findall:
import re
s=r"""99-my-name-is-John-Smith-6376827-%^-1-2-767980716"""
print re.findall('^.*-([0-9]+)$',s)
>>> ['767980716']
Regex Explanation:
^ # Match the start of the string
.* # Followed by anthing
- # Upto the last hyphen
([0-9]+) # Capture the digits after the hyphen
$ # Upto the end of the string
Or more simply just match the digits followed at the end of the string '([0-9]+)$'
Your Regex should be (\d+)$.
\d+ is used to match digit (one or more)
$ is used to match at the end of string.
So, your code should be: -
>>> s = "99-my-name-is-John-Smith-6376827-%^-1-2-767980716"
>>> import re
>>> re.compile(r'(\d+)$').search(s).group(1)
'767980716'
And you don't need to use str function here, as s is already a string.
Use the below regex
\d+$
$ depicts the end of string..
\d is a digit
+ matches the preceding character 1 to many times
Save the regular expressions for something that requires more heavy lifting.
>>> def parse_last_digits(line): return line.split('-')[-1]
>>> s = parse_last_digits(r"99-my-name-is-John-Smith-6376827-%^-1-2-767980716")
>>> s
'767980716'
I have been playing around with several of these solutions, but many seem to fail if there are no numeric digits at the end of the string. The following code should work.
import re
W = input("Enter a string:")
if re.match('.*?([0-9]+)$', W)== None:
last_digits = "None"
else:
last_digits = re.match('.*?([0-9]+)$', W).group(1)
print("Last digits of "+W+" are "+last_digits)
Try using \d+$ instead. That matches one or more numeric characters followed by the end of the string.

What does "?:" mean in a Python regular expression?

Below is the Python regular expression. What does the ?: mean in it? What does the expression do overall? How does it match a MAC address such as "00:07:32:12:ac:de:ef"?
re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}), string)
It (?:...) means a set of non-capturing grouping parentheses.
Normally, when you write (...) in a regex, it 'captures' the matched material. When you use the non-capturing version, it doesn't capture.
You can get at the various parts matched by the regex using the methods in the re package after the regex matches against a particular string.
How does this regular expression match MAC address "00:07:32:12:ac:de:ef"?
That's a different question from what you initially asked. However, the regex part is:
([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})
The outer most pair of parentheses are capturing parentheses; what they surround will be available when you use the regex against a string successfully.
The [\dA-Fa-f]{2} part matches a digit (\d) or the hexadecimal digits A-Fa-f], in a pair {2}, followed by a non-capturing grouping where the matched material is a colon or dash (: or -), followed by another pair of hex digits, with the whole repeated exactly 5 times.
p = re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}))
m = p.match("00:07:32:12:ac:de:ef")
if m:
m.group(1)
The last line should print the string "00:07:32:12:ac:de" because that is the first set of 6 pairs of hex digits (out of the seven pairs in total in the string). In fact, the outer grouping parentheses are redundant and if omitted, m.group(0) would work (it works even with them). If you need to match 7 pairs, then you change the 5 into a 6. If you need to reject them, then you'd put anchors into the regex:
p = re.compile(^([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})$)
The caret ^ matches the start of string; the dollar $ matches the end of string. With the 5, that would not match your sample string. With 6 in place of 5, it would match your string.
Using ?: as in (?:...) makes the group non-capturing during replace. During find it does'nt make any sense.
Your RegEx means
r"""
( # Match the regular expression below and capture its match into backreference number 1
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
(?: # Match the regular expression below
[:-] # Match a single character present in the list below
# The character “:”
# The character “-”
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
){5} # Exactly 5 times
)
"""
Hope this helps.
It does not change the search process. But it affects the retrieval of the group after the match has been found.
For example:
Text:
text = 'John Wick'
pattern to find:
regex = re.compile(r'John(?:\sWick)') # here we are looking for 'John' and also for a group (space + Wick). the ?: makes this group unretrievable.
When we print the match - nothing changes:
<re.Match object; span=(0, 9), match='John Wick'>
But if you try to manually address the group with (?:) syntax:
res = regex.finditer(text)
for i in res:
print(i)
print(i.group(1)) # here we are trying to retrieve (?:\sWick) group
it gives us an error:
IndexError: no such group
Also, look:
Python docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
the link to the re page in docs:
https://docs.python.org/3/library/re.html
(?:...) means a non cature group. The group will not be captured.

Categories