I'm trying to extract tokens/part of tokens that have numeric/alphanumeric characters that have a length greater than 8 from the text.
Example:
text = 'https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8'
The expected output would be :
59800512 510557XXXXXX2302 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg 69i57j0i22i30l8j0i390 4672j0j7
I have tried using the regular expression : ((\d+)|([A-Za-z]+\d)[\dA-Za-z]*) based on the answer Python Alphanumeric Regex. I got the following results :
[match for match in re.findall(r"((\d+)|([A-Za-z]+\d)[\dA-Za-z]*)",text)]
Output :
[('59800512', '59800512', ''),
('510557', '510557', ''),
('XXXXXX2302', '', 'XXXXXX2'),
('1601371803', '1601371803', ''),
('NhLw6NlR0EksRWkLddEo7NiEvrg', '', 'NhLw6'),
('69', '69', ''),
('i57j0i22i30l8j0i390', '', 'i5'),
('4672', '4672', ''),
('j0j7', '', 'j0'),
('8', '8', '')]
I'm getting a tuple of matching groups for each matching token.
It is possible to filter these tuples again. But I'm trying to make the code as efficient and pythonic as possible.
Could anyone suggest a solution? It need not be based on regular expressions.
Thanks in advance
Edit :
I expect alphanumeric values of length equal to or greater than 8
You get the tuples in the result, as re.findall returns the values of the capture groups.
But you can omit the capture groups and change the pattern to a single match, matching at least a digit between chars A-Z a-z and assert a minimum of 8 characters using a positive lookahead.
\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b
\b A word boundary
(?=[A-Za-z0-9]{8}) Positive lookahead, assert at least 8 occurrences of any of the listed ranges
[A-Za-z]* Optionally match a char A-Z a-z
\d Match a digit
[A-Za-z\d]* Optionall match a char A-Z a-z or a digit
\b A word boundary
See a regex demo or a Python demo.
import re
from pprint import pprint
pattern = r"\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b"
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"
pprint(re.findall(pattern, s))
Output
['59800512',
'510557XXXXXX2302',
'1601371803',
'NhLw6NlR0EksRWkLddEo7NiEvrg',
'69i57j0i22i30l8j0i390',
'4672j0j7']
I came up with:
\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b
See an online demo
\b - Word boundary.
[A-Za-z]{,7} - 0-7 times a alphachar.
\d - A single digit.
[A-Za-z\d]{7,} - 7+ times an alphanumeric char.
\b - Word boundary.
Some sample code:
import re
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"
result = re.findall(r'\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b', s)
print(result)
Prints:
['59800512', '510557XXXXXX2302', '1601371803', 'NhLw6NlR0EksRWkLddEo7NiEvrg', '69i57j0i22i30l8j0i390', '4672j0j7']
You could opt to match case-insensitive with:
(?i)\b[a-z]{,7}\d[a-z\d]{7,}\b
Although the selected answer returns the required output, it is not generic, and it fails to match specific cases (eg., s= "thisword2H2g2d")
For a more generic regex that works for all combinations of alphanumeric values:
result = re.findall(r"(\d+[A-Za-z\d]+\d*)|([A-Za-z]+[\d]+[A-Za-z\d]*)")
See the demo here.
Related
Example first:
import re
details = 'input1 mem001 output1 mem005 data2 mem002 output12 mem006'
input_re = re.compile(r'(?!output[0-9]*) mem([0-9a-f]+)')
print(input_re.findall(details))
# Out: ['001', '005', '002', '006']
I am using negative lookahead to extract the hex part of the mem entries that are not preceded by an output, however as you can see it fails. The desired output should be: ['001', '002'].
What am I missing?
You may use this regex in findall:
\b(?!output\d+)\w+\s+mem([a-zA-F\d]+)
RegEx Demo
RegEx Details:
\b: Word boundary
(?!output\d+): Negative lookahead to assert that we don't have output and 1+ digits ahead
\w+: Match 1+ word characters
\s+: Match 1+ whitespaces
mem([a-zA-F\d]+): Match mem followed by 1+ of any hex character
Code:
import re
s = 'input1 mem001 output1 mem005 data2 mem002 output12 mem006'
print( re.findall(r'\b(?!output\d+)\w+\s+mem([a-zA-F\d]+)', s) )
Output:
['001', '002']
Maybe an easier approach is to split it up in 2 regular expressions ?
First filter out anything that starts with output and is followed by mem like so
output[0-9]* mem([0-9a-f]+)
If you filter this out it would result in
input1 mem001 data2 mem002
When you have filtered them out just search for mem again
mem([0-9a-f]+)
That would result in your desired output
['001', '002']
Maybe not an answer to the original question, but it is a solution to your problem
First of all, let's understand why your original regex doesn't work:
A regex encapsulates two pieces of information: a description of a location within a text, and a description of what to capture from that location. Your original regex tells the regex matcher: "Find a location within the text where the following characters are not 'output'+digits but they are ' mem'+alphanumetics". Think of the logic of that expression: if the matcher finds a location in the text where the following characters are ' mem'+alphanumerics, then, in particular, the following characters are not 'output'+digits. Your look ahead does not add anything to the exoression.
What you really need is to tell the matcher: "Find a location in the text where the following characters are ' mem'+alphanumerics, and the previous characters are not 'output'+digits. So what you really need is a look-behind, not look-ahead.
#ArtyomVancyan proposed a good regex with a look-behind, and it could easily be modified to what you need: instead of a single digit after the 'output', you want potentially more digits, so just put an asterisk (*) after the '\d'.
I don't understand why it only gives 125, the first number only, why it does not give all positive numbers in that string? My goal is to extract all positive numbers.
import re
pattern = re.compile(r"^[+]?\d+")
text = "125 -898 8969 4788 -2 158 -947 599"
matches = pattern.finditer(text)
for match in matches:
print(match)
Try using the regular expression
-\d+|(\d+)
Disregard the matches. The strings representing non-negative integers are saved in capture group 1.
Demo
The idea is to match but not save to a capture group what you don't want (negative numbers), and both match and save to a capture group what you do want (non-negative numbers).
The regex attempts to match -\d+. If that succeeds the regex engine's internal string pointer is moved to just after the last digit matched. If -\d+ is not matched an attempt is made to match the second part of the alternation (following |). If \d+ is matched the match is saved to capture group 1.
Any plus signs in the string can be disregarded.
For a fuller description of this technique see The Greatest Regex Trick Ever. (Search for "Tarzan"|(Tarzan) to get to the punch line.)
The following pattern will only match non negative numbers:
pattern = re.compile("(?:^|[^\-\d])(\d+)")
pattern.findall(text)
OUTPUT
['125', '8969', '4788', '158', '599']
For the sake of completeness another idea by use of \b and a lookbehind.
\b(?<!-)\d+
See this demo at regex101
Your pattern ^[+]?\d+ is anchored at the start of the string, and will give only that match at the beginning.
Another option is to assert a whitspace boundary to the left, and match the optional + followed by 1 or more digits.
(?<!\S)\+?\d+\b
(?<!\S) Assert a whitespace boundary to the left
\+? Match an optional +
\d+\b Match 1 or more digits followed by a word bounadry
Regex demo
Use , to sperate the numbers in the string.
Python 3.8.2
the task at hand is simple: to match lowercase characters separated by a single underscore. So the pattern could be r"[a-z]+_[a-z]+"
now my issue is that I expected re.findall() to pair up all the following:
"ash_tonic_transit_so_kern_err_looo_"
instead of paring all the words around each underscore ('ash_tonic', 'tonic_transit', 'transit_so', ETC) I get three pairs: ['ash_tonic', 'transit_so', 'kern_err']
Does python re omit part of the string once a match has been found instead of running the search again?
import re
def match_lower(s):
patternRegex = re.compile(r'[a-z]+_[a-z]+')
mo = patternRegex.findall(s)
return mo
print(match_lower('ash_tonic_transit_so_kern_err_looo_'))
You could use a positive lookahead with a capturing group to get the matches, and start the match asserting what is directly to the left is not a char a-z using a negative lookbehind.
Use re.findall which will return the values from the capturing group.
(?<![a-z])(?=([a-z]+_[a-z]+))
Explanation
(?<![a-z]) Negative lookabehind, assert what is directly to the left is not a char a-z
(?= Positive lookahead, assert what on the right is
([a-z]+_[a-z]+) Capture group 1, match 1+ chars a-z _ 1+ chars a-z
) Close lookahead
Regex demo | Python demo
import re
regex = r"(?<![a-z])(?=([a-z]+_[a-z]+))"
test_str = "ash_tonic_transit_so_kern_err_looo_"
print(re.findall(regex, test_str))
Output
['ash_tonic', 'tonic_transit', 'transit_so', 'so_kern', 'kern_err', 'err_looo']
This is explicitly mentioned in the documentation of re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings.
For instance, 'ash_tonic' and 'tonic_transit' overlap, so they won't be considered two distinct matches.
I have a string s = '10000',
I need using only the Python re.findall to get how many 0\d0 in the string s
For example: for the string s = '10000' it should return 2
explanation:
the first occurrence is 10000 while the second occurrence is 10000
I just need how many occurrences and not interested in the occurrence patterns
I've tried the following regex statements:
re.findall(r'(0\d0)', s) #output: ['000']
re.findall(r'(0\d0)*', s) #output: ['', '', '000', '', '', '']
Finally, if I want to make this regex generic to fetch any number then
any_number_included_my_number then the_same_number_again, how can I do it?
How to get all possible occurrences?
The regex
As I mentioned in my comment, you can use the following pattern:
(?=(0\d0))
How it works:
(?=...) is a positive lookahead ensuring what follows matches. This doesn't consume characters (allowing us to check for a match at each position in the string as a regex would otherwise resume pattern matching after the consumed characters).
(0\d0) is a capture group matching 0, then any digit, then 0
The code
Your code becomes:
See code in use here
re.findall(r'(?=(0\d0))', s)
The result is:
['000', '000']
The python re.findall method states the following
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
This means that our matches are the results of capture group 1 rather than the full match as many would expect.
How to generalize the pattern?
The regex
You can use the following pattern:
(\d)\d\1
How this works:
(\d) captures any digit into capture group 1
\d matches any digit
\1 is a backreference that matches the same text as most recently matched by capture group 1
The code
Your code becomes:
See code in use here
re.findall(r'(?=((\d)\d\2))', s)
print([n[0] for n in x])
Note: The code above has two capture groups, so we need to change the backreference to \2 to match correctly. Since we now have two capture groups, we will get tuples as the documentation states and can use list comprehension to get the expected results.
The result is:
['000', '000']
I'd like to match number, positive or negative, possibly with currency sign in front. But I don't want something like PSM-9. My code is:
test='AAA PCSK-9, $111 -3,33'
re.findall(r'\b-?[$€£]?-?\d+[\d,.]*\b', test)
Output is:['-9', '111', '3,33']
Could someone explain why -9 is matched? Thank you in advance.
Edit:
I don't any part of PCSK-9 is matched it is like a name of a product rather a number. So my desired output is:
['111', '3,33']
This is because \b matches the gap between K and -, a word and a non-word character. If you want to avoid matching - if it's preceded by a word you can use negative lookbehind instead:
re.findall(r'[$€£]?(?:(?<!\w)-)?\d+[\d,.]*\b', test)
With your sample input, this returns:
['9', '111', '3,33']
Demo: https://regex101.com/r/A66C5W/1
The word boundary matches between the K and the dash. The 2 parts after the dash [$€£]?-? are optional because of the questionmark and then you match one or more times a digit. This results in the match -9
What you might use instead of a word boundary is an assertion that checks if what is before and after the match is not a non whitespace character \S using a negative lookbehind and a negative lookahead.
(?<!\S)-?[$€£]?(\d+(?:[,.]\d+)?)(?!\S)
Regex demo | Python demo
-9 is matched because - is a non-word character, and S is a word character... so in between there's an interword boundary \b, as you state in your regexp.