Python regular expression retrieving numbers between two different delimiters - python

I have the following string
"h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
I would like to use regular expressions to extract the groups:
group1 56,7,1
group2 88,9,1
group3 58,8,1
group4 45
group5 100
group6 null
My ultimate goal is to have tuples such as (group1, group2), (group3, group4), (group5, group6). I am not sure if this all can be accomplished with regular expressions.
I have the following regular expression with gives me partial results
(?<=h=|d=)(.*?)(?=h=|d=)
The matches have an extra comma at the end like 56,7,1, which I would like to remove and d=, is not returning a null.

You likely do not need to use regex. A list comprehension and .split() can likely do what you need like:
Code:
def split_it(a_string):
if not a_string.endswith(','):
a_string += ','
return [x.split(',')[:-1] for x in a_string.split('=') if len(x)][1:]
Test Code:
tests = (
"h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,",
"h=56,7,1,d=88,9,1,d=,h=58,8,1,d=45,h=100",
)
for test in tests:
print(split_it(test))
Results:
[['56', '7', '1'], ['88', '9', '1'], ['58', '8', '1'], ['45'], ['100'], ['']]
[['56', '7', '1'], ['88', '9', '1'], [''], ['58', '8', '1'], ['45'], ['100']]

You could match rather than split using the expression
[dh]=([\d,]*),
and grab the first group, see a demo on regex101.com.
That is
[dh]= # d or h, followed by =
([\d,]*) # capture d and s 0+ times
, # require a comma afterwards
In Python:
import re
rx = re.compile(r'[dh]=([\d,]*),')
string = "h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
numbers = [m.group(1) for m in rx.finditer(string)]
print(numbers)
Which yields
['56,7,1', '88,9,1', '58,8,1', '45', '100', '']

You can use ([a-z]=)([0-9,]+)(,)?
Online demo
just you need add index to group

You could use $ in positive lookahead to match against the end of the string:
import re
input_str = "h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
groups = []
for x in re.findall('(?<=h=|d=)(.*?)(?=d=|h=|$)', input_str):
m = x.strip(',')
if m:
groups.append(m.split(','))
else:
groups.append(None)
print(groups)
Output:
[['56', '7', '1'], ['88', '9', '1'], ['58', '8', '1'], ['45'], ['100'], None]

Here, I have assumed that parameters will only have numerical values. If it is so, then you can try this.
(?<=h=|d=)([0-9,]*)
Hope it helps.

Related

Regex .search with grouping is not collecting groups

I am trying to search through the following list
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
using this code:
next_page = re.compile(r'/(\d+)_p/$')
matches = list(filter(next_page.search, href_search)) #search or .match
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
and am getting the following error: AttributeError: 'str' object has no attribute 'group'.
I thought that the parenthesis around the \d+ would group the one or more numbers. My goal is to get the number preceding "_p/" at the end of the string.
You are filtering your original list, so what is being returned are the original strings, not the match objects. If you want to return the match objects, you need to map the search to the list, then filter the match objects. For example:
next_page = re.compile(r'/(\d+)_p/$')
matches = filter(lambda m:m is not None, map(next_page.search, href_search))
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
Output:
/2_p/
/3_p/
/6_p/
/7_p/
/8_p/
/2_p/
If you only want the number part of the match, use match.group(1) instead of match.group().
I think re.findall should do the trick:
next_page.findall(href_search) # ['2', '3', '6', '7', '8', '2']
Alternatively, you could split the lines and then search them individually:
matches = []
for line in href_search.splitlines():
match = next_page.search(line)
if match:
matches.append(match.group(1))
matches # ['2', '3', '6', '7', '8', '2']
You can try this:
import re
# add re.M to match the end of each line
next_page = re.compile(r'/(\d+)_p/$', re.M)
matches = next_page.findall(href_search)
print(matches)
It gives:
['2', '3', '6', '7', '8', '2']
The filter function will only remove the lines that don't match the regex and will return the string, eg:
>>> example = ["abc", "def", "ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> list(filter(my_match.search, example))
['123']
If you want the match object then a list comprehension could do the trick:
>>> example = ["abc", "def45", "67ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> [my_match.search(line) for line in example] # Get the matches
[None,
<re.Match object; span=(3, 5), match='45'>,
None,
<re.Match object; span=(0, 3), match='123'>]
>>> [match.group() for match in [my_match.search(line) for line in example] if match is not None] # Filter None values
['45', '123']
You can do regex (?<=\/)\d+(?=\_p\/$). See regex101 as example
Explanation:
(?<=\/) : Look behind for /
\d+ : Look for one or more digits
(?=\_p\/$) : Look ahead for _p/ at the end of string
If there is a match, then return only \d+ value.
You can either write the code to grab all the data at once or iterate through them line by line and get the data you need.
Below is the code for both:
text_line = '''/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/'''
import re
for txt in text_line.split('\n'):
t = re.findall(r'(?<=\/)\d+(?=\_p\/$)',txt)
print (t)
t = re.findall(r'(?<=\/)\d+(?=\_p\/)',text_line)
print (t)
The first part does it line by line and the second result is to grab it one shot.
Output of both are:
Line by line:
['2']
['3']
['6']
['7']
['8']
['2']
Grab all at once:
['2', '3', '6', '7', '8', '2']
For the second one, I didn't give the $ sign as we need to grab all of it.

How can I use regex to match only one character in Python?

I am trying do precess a list of files
file_list = ['.DS_Store', '9', '7', '6', '8', '01', '4', '3', '2', '5']
the goal is to find the files whose name has only one character.
I tried this code
r = re.compile('[0-9]')
result_list = list(filter(r.match, file_list))
result_list
and got
['9', '7', '6', '8', '01', '4', '3', '2', '5']
where '01' should not be included.
I made a workaround
tmp = []
for i in file_list:
if len(i)==1:
tmp.append(i)
tmp
and I got
['9', '7', '6', '8', '4', '3', '2', '5']
this is exactly what I want. Although the method is ugly.
how can I use regex in Python to finish the task?
r = re.compile('^[0-9]$')
The ^ matches the beginning of a line and $ matches the end.
And if you really want it to match any character, not just numbers, it should be
r = re.compile('^.$')
The . in the regex is a single-character wildcard.
Match a string if it's simply any single character appearing at the beginning of the string (^.) right before the end of the string ($):
^.$
Regex101
Your Python then becomes:
r = re.compile('^.$')
result_list = list(filter(r.match, file_list))
Your code is equivalent to
[ i for i in file_list if len(i)==1]
And this method adapts to every case in which file's name has only one character.

Splitting a string similar to ip addresses using regex in Python

I want to have a regular expression which will split on seeing a '.'(dot)
For example:
Input: '1.2.3.4.5.6'
Output : ['1', '2', '3', '4', '5', '6']
What I have tried:-
>>> pattern = '(\d+)(\.(\d+))+'
>>> test = '192.168.7.6'
>>> re.findall(pat, test)
What I get:-
[('192', '.6', '6')]
What I expect from re.findall():-
[('192', '168', '7', '6')]
Could you please help in pointing what is wrong?
My thinking -
In pattern = '(\d+)(\.(\d+))+', initial (\d+) will find first number i.e. 192 Then (\.(\d+))+ will find one or more occurences of the form '.<number>' i.e. .168 and .7 and .6
[EDIT:]
This is a simplified version of the problem I am solving.
In reality, the input can be-
192.168 dot 7 {dot} 6
and expected output is still [('192', '168', '7', '6')].
Once I figure out the solution to extract .168, .7, .6 like patterns, I can then extend it to dot 168, {dot} 7 like patterns.
Since you only need to find the numbers, the regex \d+ should be enough to find numbers separated by any other token/separator:
re.findall("\d+", test)
This should work on any of those cases:
>>> re.findall("\d+", "192.168.7.6")
['192', '168', '7', '6']
>>> re.findall("\d+", "192.168 dot 7 {dot} 6 | 125 ; 1")
['192', '168', '7', '6', '125', '1']

How to remove whitespace in a list

I can't remove my whitespace in my list.
invoer = "5-9-7-1-7-8-3-2-4-8-7-9"
cijferlijst = []
for cijfer in invoer:
cijferlijst.append(cijfer.strip('-'))
I tried the following but it doesn't work. I already made a list from my string and seperated everything but the "-" is now a "".
filter(lambda x: x.strip(), cijferlijst)
filter(str.strip, cijferlijst)
filter(None, cijferlijst)
abc = [x.replace(' ', '') for x in cijferlijst]
Try that:
>>> ''.join(invoer.split('-'))
'597178324879'
If you want the numbers in string without -, use .replace() as:
>>> string_list = "5-9-7-1-7-8-3-2-4-8-7-9"
>>> string_list.replace('-', '')
'597178324879'
If you want the numbers as list of numbers, use .split():
>>> string_list.split('-')
['5', '9', '7', '1', '7', '8', '3', '2', '4', '8', '7', '9']
This looks a lot like the following question:
Python: Removing spaces from list objects
The answer being to use strip instead of replace. Have you tried
abc = x.strip(' ') for x in x

python regex extraction of fields using re.compile

array= ['gmond 10-22:13:29','bash 12-25:13:59']
regex = re.compile(r"((\d+)\-)?((\d+):)?(\d+):(\d+)$")
for key in array :
res = regex.match(key)
if res:
print res.group(2)
print res.group(5)
print res.group(6)
I know I am doing it wrong . But I tried several things , and failed. Can some one help me how can I fetch the patter macthes using group or any better way. I want to fetch the digits if the pattern is matched. This works so smooth with re.search but have to do it using re.compile in this case. Appreciate ur help.
You can use re.findall if you are sure of the format the elements of array:
>>> import re
>>> array = ["10-22:13:29", "12-25:13:59"]
>>> regex = re.compile(r"\d+")
>>> for key in array:
... res = regex.findall(key)
... if res:
... print res
...
['10', '22', '13', '29']
['12', '25', '13', '59']
You can use search with compile just as well. (match matches only at the beginning of the )
You are catching - and :, also, you have redundant brackets. Here's the code with modified regex:
import re
array = ["10-22:13:29", "12-25:13:59"]
regex = re.compile(r"^(\d+)\-?(\d+):?(\d+):?(\d+)$")
for key in array:
res = regex.match(key)
if res:
print res.groups()
prints:
('10', '22', '13', '29')
('12', '25', '13', '59')
See, all digits are extracted properly.

Categories