Regular expression to extract number with hyphen - python

The text is like "1-2years. 3years. 10years."
I want get result [(1,2),(3),(10)].
I use python.
I first tried r"([0-9]?)[-]?([0-9])years". It works well except for the case of 10. I also tried r"([0-9]?)[-]?([0-9]|10)years" but the result is still [(1,2),(3),(1,0)].

Your attempt r"([0-9]?)[-]?([0-9])years" doesn't work for the case of 10 because you ask it to match one (or zero) digit per group.
You also don't need the hyphen in brackets.
This should work: Regex101
(\d+)(?:-(\d+))?years
Explanation:
(\d+): Capturing group for one or more digits
(?: ): Non-capturing group
- : hyphen
(\d+): Capturing group for one or more digits
(?: )?: Make the previous non-capturing group optional
In python:
import re
result = re.findall(r"(\d+)(?:-(\d+))?years", "1-2years. 3years. 10years.")
# Gives: [('1', '2'), ('3', ''), ('10', '')]
Each tuple in the list contains two elements: The number on the left side of the hyphen, and the number on the right side of the hyphen. Removing the blank elements is quite easy: you loop over each item in result, then you loop over each match in this item and only select it (and convert it to int) if it is not empty.
final_result = [tuple(int(match) for match in item if match) for item in result]
# gives: [(1, 2), (3,), (10,)]

This should work:
import re
st = '1-2years. 3years. 10years.'
result = [tuple(e for e in tup if e)
for tup in re.findall(r'(?:(\d+)-(\d+)|(\d+))years', st)]
# [('1', '2'), ('3',), ('10',)]
The regex will look for either one number, or two separated by a hyphen, immediately prior to the word years. If we give this to re.findall(), it will give us the output [('1', '2', ''), ('', '', '3'), ('', '', '10')], so we also use a quick list comprehension to filter out the empty strings.
Alternately we could use r'(\d+)(?:-(\d+))?years' to basically the same effect, which is closer to what you've already tried.

You can use this pattern: (?:(\d+)-)?(\d+)years
See Regex Demo
Code:
import re
pattern = r"(?:(\d+)-)?(\d+)years"
text = "1-2years. 3years. 10years."
print([tuple(int(z) for z in x if z) for x in re.findall(pattern, text)])
Output:
[(1, 2), (3,), (10,)]

You only match a single digit as the character class [0-9] is not repeated.
Another option is to match the first digits with an optional part for - and digits.
Then you can split the matches on -
\b(\d+(?:-\d+)?)years\.
\b A word boundary
( Capture group 1 (which will be returned by re.findall)
\d+(?:-\d+)? Match 1+ digits and optionally match - and again 1+ digits
) Close group 1
years\. Match literally with the escaped .
See a regex demo and a Python demo.
Example
pattern = r"\b(\d+(?:-\d+)?)years\."
s = "1-2years. 3years. 10years."
res = [tuple(v.split('-')) for v in re.findall(pattern, s)]
print(res)
Output
[('1', '2'), ('3',), ('10',)]
Or if a list of lists is also ok instead of tuples
res = [v.split('-') for v in re.findall(pattern, s)]
Output
[['1', '2'], ['3'], ['10']]

Related

Python recognize part of string (position and length)

I have got a file (.VAR) which gives me a positions and lengths in a strings per row, see the example below.
*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE
How do i retrieve the " xxLxxx:" value, which is always preceded by a space and always ends with a colon, but never on the same location within the string.
Preferably I would like to find the number before L as the position, and the number behind L as the length, but only searching for "L" would give me also the input from other values within the string. Therefore I think I have to use the space_number_L_number_colon to recognize this part, but I don't know how.
Any thoughts? TIA
You can use a regex here.
Example:
s='''*STRING1 1L8:StringONE
*STRINGWITHVARIABLELENGTH2 *ABC 29L4:StringTWO
*STRINGWITHLENGTH3 *ABC 33L2:StringTHREE'''
import re
out = re.findall(r'\s(\d+)L(\d+):', s)
output: [('1', '8'), ('29', '4'), ('33', '2')]
As integers:
out = [tuple(map(int, x)) for x in re.findall(r'\s(\d+)L(\d+):', s)]
output: [(1, 8), (29, 4), (33, 2)]
regex:
regex demo
\s # space
(\d+) # capture one or more digits
L # literal L
(\d+) # capture one or more digits
: # literal :

Detecting alphanumeric/numeric values in python string

I'm trying to extract tokens/part of tokens that have numeric/alphanumeric characters that have a length greater than 8 from the text.
Example:
text = 'https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8'
The expected output would be :
59800512 510557XXXXXX2302 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg 69i57j0i22i30l8j0i390 4672j0j7
I have tried using the regular expression : ((\d+)|([A-Za-z]+\d)[\dA-Za-z]*) based on the answer Python Alphanumeric Regex. I got the following results :
[match for match in re.findall(r"((\d+)|([A-Za-z]+\d)[\dA-Za-z]*)",text)]
Output :
[('59800512', '59800512', ''),
('510557', '510557', ''),
('XXXXXX2302', '', 'XXXXXX2'),
('1601371803', '1601371803', ''),
('NhLw6NlR0EksRWkLddEo7NiEvrg', '', 'NhLw6'),
('69', '69', ''),
('i57j0i22i30l8j0i390', '', 'i5'),
('4672', '4672', ''),
('j0j7', '', 'j0'),
('8', '8', '')]
I'm getting a tuple of matching groups for each matching token.
It is possible to filter these tuples again. But I'm trying to make the code as efficient and pythonic as possible.
Could anyone suggest a solution? It need not be based on regular expressions.
Thanks in advance
Edit :
I expect alphanumeric values of length equal to or greater than 8
You get the tuples in the result, as re.findall returns the values of the capture groups.
But you can omit the capture groups and change the pattern to a single match, matching at least a digit between chars A-Z a-z and assert a minimum of 8 characters using a positive lookahead.
\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b
\b A word boundary
(?=[A-Za-z0-9]{8}) Positive lookahead, assert at least 8 occurrences of any of the listed ranges
[A-Za-z]* Optionally match a char A-Z a-z
\d Match a digit
[A-Za-z\d]* Optionall match a char A-Z a-z or a digit
\b A word boundary
See a regex demo or a Python demo.
import re
from pprint import pprint
pattern = r"\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b"
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"
pprint(re.findall(pattern, s))
Output
['59800512',
'510557XXXXXX2302',
'1601371803',
'NhLw6NlR0EksRWkLddEo7NiEvrg',
'69i57j0i22i30l8j0i390',
'4672j0j7']
I came up with:
\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b
See an online demo
\b - Word boundary.
[A-Za-z]{,7} - 0-7 times a alphachar.
\d - A single digit.
[A-Za-z\d]{7,} - 7+ times an alphanumeric char.
\b - Word boundary.
Some sample code:
import re
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"
result = re.findall(r'\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b', s)
print(result)
Prints:
['59800512', '510557XXXXXX2302', '1601371803', 'NhLw6NlR0EksRWkLddEo7NiEvrg', '69i57j0i22i30l8j0i390', '4672j0j7']
You could opt to match case-insensitive with:
(?i)\b[a-z]{,7}\d[a-z\d]{7,}\b
Although the selected answer returns the required output, it is not generic, and it fails to match specific cases (eg., s= "thisword2H2g2d")
For a more generic regex that works for all combinations of alphanumeric values:
result = re.findall(r"(\d+[A-Za-z\d]+\d*)|([A-Za-z]+[\d]+[A-Za-z\d]*)")
See the demo here.

python regex how to capture all groups that may be optional?

minutes_pattern = re.compile(
r"""
(?:(?P<hours>\d+)H)?
(?:(?P<minutes>\d+)M)?
(?:(?P<seconds>\d+)S)?
""",
re.VERBOSE,
)
I'm trying to capture all HOURS, MINUTES, SECONDS.
However, these may or may not exist:
Examples:
PT30H20M10S
PT20M10S
PT20M
PT3010S
PT30H
I'm thinking (?:(?P<hours>\d+)H)? says:
?: non capturing group
?P hours group name
\d+ any number of digits
H must exist
)? this group may or may not exist
But this compile does not capture any of the 3 groups. Help please
regex = minutes_pattern.search(duration)
print(regex.groups())
This only returns None
(None, None, None)
Ideally this would show (for the 1st example):
group hours: 30
group minutes: 20
group seconds: 10
And for the 2nd example:
group hours: None
group minutes: 20
group seconds: 10
The problem is that your regex successfully matches an empty string, so when you use .search to search for the first match, the first match occurs at index 0 with a length of 0:
>>> minutes_pattern.search(duration)
<_sre.SRE_Match object; span=(0, 0), match=''>
The simplest way to fix this would be to include the letters PT at the start of your regex, since these are always present in your input string, and they prevent the regex returning an empty match:
>>> minutes_pattern = re.compile(r"PT(?:(\d+)H)?(?:(\d+)M)?(?:(\d+)S)?")
>>> minutes_pattern.search(duration)
<_sre.SRE_Match object; span=(0, 11), match='PT30H20M10S'>
>>> minutes_pattern.search(duration).groups()
('30', '20', '10')
Alternatively, if you don't want to include PT in your regex (perhaps some other inputs won't include that prefix), you could use .findall instead of .search, and then filter the list to find the "non-trivial" match:
>>> minutes_pattern = re.compile(r"(?:(\d+)H)?(?:(\d+)M)?(?:(\d+)S)?")
>>> minutes_pattern.findall(duration)
[('', '', ''), ('', '', ''), ('30', '20', '10'), ('', '', '')]
>>> next(filter(any, minutes_pattern.findall(duration)), None)
('30', '20', '10')

Why is my Regular Expression not matching?

I have the following pattern:
find_pattern = re.compile(r'(ga:country:\s)([a-zA-Z()\s]*)(.*users:\s)(\d+),')
This is how the input that should be matched looks like:
ga:country: (not set),Date range:0,ga:users:60,
ga:country: Albania,Date range:0,ga:users:7,
ga:country: Algeria,Date range:0,ga:users:10,
...
ga:country: Argentina,Date range:0,ga:users:61,
ga:country: Armenia,Date range:0,ga:users:2,
And this is how the output is going to be formatted (in case it adds any value to the question):
['(not set)', 60],
['Albania', 7],
And when I run a test:
matches = find_pattern.finditer(self.data)
print('matches:', matches)
for match in matches:
print(match)
No matches are found.
Hope someone is able to help.
I would suggest using 2 capturing groups instead of 4, add optional whitespace chars after ga: and make the whitspace chars optional after users:
The .* could also be non greedy .*? to get the first one in case there are more users: parts.
To prevent users: begin part of a larger word, you could make it more specific matching :users:
\bga:\s*country:\s*([a-zA-Z()\s]*),.*?:users:(\d+)
Regex demo
Example with re.findall that returns the values of the capturing groups:
import re
regex = r"\bga:\s*country:\s*([a-zA-Z()\s]*),.*?:users:(\d+)"
s = ("ga:country: (not set),Date range:0,ga:users:60,\n"
"ga:country: Albania,Date range:0,ga:users:7,\n"
"ga:country: Algeria,Date range:0,ga:users:10,\n"
"ga:country: Argentina,Date range:0,ga:users:61,\n"
"ga:country: Armenia,Date range:0,ga:users:2,")
print(re.findall(regex, s))
Output
[('(not set)', '60'), ('Albania', '7'), ('Algeria', '10'), ('Argentina', '61'), ('Armenia', '2')]

How can I get the number of groups to vary depending on the number of lines?

I have this regex: ^:([^:]+):([^:]*) which works as in this regex101 link.
Now, in Python, I have this:
def get_data():
data = read_mt_file()
match_fields = re.compile('^:([^:]+):([^:]*)', re.MULTILINE)
fields = re.findall(match_fields, data)
return fields
Which, for a file containing the data from regex101, returns:
[('1', 'text\ntext\n\n'), ('20', 'text\n\n'), ('21', 'text\ntext\ntext\n\n'), ('22', ' \n\n'), ('25', 'aa\naa\naaaaa')]
Now, this is ok, but I want to change the regex, so that I can get the number of groups to vary depending on the number of lines. Meaning:
for the first line, now, I get two groups:
1
text\ntext\n\n
I'd like to get instead:
1
((text\n), (text\n\n)) <-- those should be somehow in the same group but separated, each in his own subgroup. Somehow I need to know they both belong to 1 field, but are sepparate lines.
So, In python, the desired result for that file would be:
[('1', '(text\n), (text\n\n)'), ('20', 'text\n\n'), ('21', '(text\n), (text\n), (text\n\n)'), ('22', ' \n\n'), ('25', '(aa\n), (aa\n), (aaaaa)')]
Is this possible with regex? Could this be achieved with some nice string manipulation instead ?
To do what you want, you'd need another regex.
This is as re.match only matches the last item it matches:
>>> re.match(r'(\d)+', '12345').groups()
('5',)
Instead of using one regex you'll need to use two.
The one that you are using at the moment, and then one to match all the 'sub-groups', using say re.findall.
You can get these sub-groups by simply matching anything that isn't a \n and then any amount of \n.
So you could use a regex such as [^\n]+\n*:
>>> re.findall(r'[^\n]+\n*', 'text\ntext')
['text\n', 'text']
>>> re.findall(r'[^\n]+\n*', 'text\ntext\n\n')
['text\n', 'text\n\n']
>>> re.findall(r'[^\n]+\n*', '')
[]
You may use a simple trick: after getting the matches with your regex, run a .+\n* regex over the Group 2 value:
import re
p = re.compile(r'^:([^:]+):([^:]+)', re.MULTILINE)
s = ":1:text\ntext\n\n:20:text\n\n:21:text\ntext\ntext\n\n:22: \n\n:25:aa\naa\naaaaa"
print([[x.group(1)] + re.findall(r".+\n*", x.group(2)) for x in p.finditer(s)])
Here,
p.finditer(s) finds all matches in the string using your regex
[x.group(1)] - a list created from the first group contents
re.findall(r".+\n*", x.group(2)) - fetches individual lines from Group 2 contents (with trailing newlines, 0 or more)
[] + re.findall - combining the lists into 1.
Result is
[['1', 'text\n', 'text\n\n'], ['20', 'text\n\n'], ['21', 'text\n', 'text\n', 'text\n\n'], ['22', ' \n\n'], ['25', 'aa\n', 'aa\n', 'aaaaa']]
Another approach: match all the substrings with your pattern and then use a re.sub to add ), ( between the lines ending with optional newlines:
[(x, "({})".format(re.sub(r".+(?!\n*$)\n+", r"\g<0>), (", y))) for x, y in p.findall(s)]
Result:
[('1', '(text\n), (text\n\n)'), ('20', '(text\n\n)'), ('21', '(text\n), (text\n), (text\n\n)'), ('22', '( \n\n)'), ('25', '(aa\n), (aa\n), (aaaaa)')]
See the Python 3 demo
Here:
p.findall(s) - grabs all the matches in the form of a list of tuples containing your capture group contents using your regex
(x, "({})".format(re.sub(r".+(?!\n*$)\n+", r"\g<0>), (", y))) - creates a tuple from Group 1 contents and Group 2 contents that are a bit modified with the re.sub the way described below
.+(?!\n*$)\n+ - pattern that matches 1+ characters other than newline and then 1+ newline symbols if they are not at the end of the string. If they are at the end of the string, there will be no replacement made (to avoid , () at the end). The \g<0> in the replacement string is re-inserting the whole match back into the resulting string and appends ), ( to it.

Categories