Regex .search with grouping is not collecting groups - python

I am trying to search through the following list
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
using this code:
next_page = re.compile(r'/(\d+)_p/$')
matches = list(filter(next_page.search, href_search)) #search or .match
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
and am getting the following error: AttributeError: 'str' object has no attribute 'group'.
I thought that the parenthesis around the \d+ would group the one or more numbers. My goal is to get the number preceding "_p/" at the end of the string.

You are filtering your original list, so what is being returned are the original strings, not the match objects. If you want to return the match objects, you need to map the search to the list, then filter the match objects. For example:
next_page = re.compile(r'/(\d+)_p/$')
matches = filter(lambda m:m is not None, map(next_page.search, href_search))
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
Output:
/2_p/
/3_p/
/6_p/
/7_p/
/8_p/
/2_p/
If you only want the number part of the match, use match.group(1) instead of match.group().

I think re.findall should do the trick:
next_page.findall(href_search) # ['2', '3', '6', '7', '8', '2']
Alternatively, you could split the lines and then search them individually:
matches = []
for line in href_search.splitlines():
match = next_page.search(line)
if match:
matches.append(match.group(1))
matches # ['2', '3', '6', '7', '8', '2']

You can try this:
import re
# add re.M to match the end of each line
next_page = re.compile(r'/(\d+)_p/$', re.M)
matches = next_page.findall(href_search)
print(matches)
It gives:
['2', '3', '6', '7', '8', '2']

The filter function will only remove the lines that don't match the regex and will return the string, eg:
>>> example = ["abc", "def", "ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> list(filter(my_match.search, example))
['123']
If you want the match object then a list comprehension could do the trick:
>>> example = ["abc", "def45", "67ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> [my_match.search(line) for line in example] # Get the matches
[None,
<re.Match object; span=(3, 5), match='45'>,
None,
<re.Match object; span=(0, 3), match='123'>]
>>> [match.group() for match in [my_match.search(line) for line in example] if match is not None] # Filter None values
['45', '123']

You can do regex (?<=\/)\d+(?=\_p\/$). See regex101 as example
Explanation:
(?<=\/) : Look behind for /
\d+ : Look for one or more digits
(?=\_p\/$) : Look ahead for _p/ at the end of string
If there is a match, then return only \d+ value.
You can either write the code to grab all the data at once or iterate through them line by line and get the data you need.
Below is the code for both:
text_line = '''/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/'''
import re
for txt in text_line.split('\n'):
t = re.findall(r'(?<=\/)\d+(?=\_p\/$)',txt)
print (t)
t = re.findall(r'(?<=\/)\d+(?=\_p\/)',text_line)
print (t)
The first part does it line by line and the second result is to grab it one shot.
Output of both are:
Line by line:
['2']
['3']
['6']
['7']
['8']
['2']
Grab all at once:
['2', '3', '6', '7', '8', '2']
For the second one, I didn't give the $ sign as we need to grab all of it.

Related

How to properly group items in a string?

I currently have a group of strings that look like this:
[58729 58708]
[58729]
[58708]
[58729]
I need to turn them into a list, but when I use list(), I get:
['[', '5', '8', '7', '2', '9', ']']
['[', '5', '8', '7', '0', '8', ']']
['[', '5', '8', '7', '2', '9', ']']
['[', '5', '8', '7', '2', '9', ' ', '5', '8', '7', '0', '8', ']']
How do I group them so that they don't get separated out into individual characters? So, something like this:
['58729', '58708']
['58729']
['58708']
['58729']
Let's say your input string is assigned to a variable foo.
foo = '[58729 58708]'
First, you want to use list slicing to get rid of the brackets at the start and end of the string:
foo = foo[1:-1]
Now, you can just use the string method split() to turn the string into a list. Here, the input of split() is the character at which the list shall be split. In your case, that would be a single space character:
foo.split(' ')
This returns
['58729', '58708'].
You can use regex to extract the values between the square brackets, then split the values into a list.
The code:
import re
s = '[58729 58708]'
result = re.search('\[(.*)\]', s).group(1).split()
The result:
>>> %Run string2list.py
['58729', '58708']
>>> %Run string2list.py
<class 'list'>
Imo the royal path would be to combine a regex with a small parser:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import re
data = """
[58729 58708]
[58729]
[58708]
[58729]
"""
# outer expression
rx = re.compile(r'\[[^\[\]]+\]')
# nodevisitor class
class StringVisitor(NodeVisitor):
grammar = Grammar(
r"""
list = lpar content+ rpar
content = item ws?
item = ~"[^\[\]\s]+"
ws = ~"\s+"
lpar = "["
rpar = "]"
"""
)
def generic_visit(self, node, visited_children):
return visited_children or node
def visit_content(self, node, visited_children):
item, _ = visited_children
return item.text
def visit_list(self, node, visited_children):
_, content, _ = visited_children
return [item for item in content]
sv = StringVisitor()
for lst in rx.finditer(data):
real_list = sv.parse(lst.group(0))
print(real_list)
Which would yield
['58729', '58708']
['58729']
['58708']
['58729']
Example with "ast" module usage
import ast
data_str = '[58729 58708]'
data_str = data_str.replace(' ',',') # make it '[58729, 58708]'
x = ast.literal_eval(data_str)
print(x)
Out[1]:
[58729, 58708]
print(x[0])
Out[2]:
58729
print(type(x))
Out[3]:
<class 'list'>
# and after all if you want exactly list of string:
[str(s) for s in x]
Out[4]:
['58729', '58708']

Python regular expression retrieving numbers between two different delimiters

I have the following string
"h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
I would like to use regular expressions to extract the groups:
group1 56,7,1
group2 88,9,1
group3 58,8,1
group4 45
group5 100
group6 null
My ultimate goal is to have tuples such as (group1, group2), (group3, group4), (group5, group6). I am not sure if this all can be accomplished with regular expressions.
I have the following regular expression with gives me partial results
(?<=h=|d=)(.*?)(?=h=|d=)
The matches have an extra comma at the end like 56,7,1, which I would like to remove and d=, is not returning a null.
You likely do not need to use regex. A list comprehension and .split() can likely do what you need like:
Code:
def split_it(a_string):
if not a_string.endswith(','):
a_string += ','
return [x.split(',')[:-1] for x in a_string.split('=') if len(x)][1:]
Test Code:
tests = (
"h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,",
"h=56,7,1,d=88,9,1,d=,h=58,8,1,d=45,h=100",
)
for test in tests:
print(split_it(test))
Results:
[['56', '7', '1'], ['88', '9', '1'], ['58', '8', '1'], ['45'], ['100'], ['']]
[['56', '7', '1'], ['88', '9', '1'], [''], ['58', '8', '1'], ['45'], ['100']]
You could match rather than split using the expression
[dh]=([\d,]*),
and grab the first group, see a demo on regex101.com.
That is
[dh]= # d or h, followed by =
([\d,]*) # capture d and s 0+ times
, # require a comma afterwards
In Python:
import re
rx = re.compile(r'[dh]=([\d,]*),')
string = "h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
numbers = [m.group(1) for m in rx.finditer(string)]
print(numbers)
Which yields
['56,7,1', '88,9,1', '58,8,1', '45', '100', '']
You can use ([a-z]=)([0-9,]+)(,)?
Online demo
just you need add index to group
You could use $ in positive lookahead to match against the end of the string:
import re
input_str = "h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
groups = []
for x in re.findall('(?<=h=|d=)(.*?)(?=d=|h=|$)', input_str):
m = x.strip(',')
if m:
groups.append(m.split(','))
else:
groups.append(None)
print(groups)
Output:
[['56', '7', '1'], ['88', '9', '1'], ['58', '8', '1'], ['45'], ['100'], None]
Here, I have assumed that parameters will only have numerical values. If it is so, then you can try this.
(?<=h=|d=)([0-9,]*)
Hope it helps.

How to remove whitespace in a list

I can't remove my whitespace in my list.
invoer = "5-9-7-1-7-8-3-2-4-8-7-9"
cijferlijst = []
for cijfer in invoer:
cijferlijst.append(cijfer.strip('-'))
I tried the following but it doesn't work. I already made a list from my string and seperated everything but the "-" is now a "".
filter(lambda x: x.strip(), cijferlijst)
filter(str.strip, cijferlijst)
filter(None, cijferlijst)
abc = [x.replace(' ', '') for x in cijferlijst]
Try that:
>>> ''.join(invoer.split('-'))
'597178324879'
If you want the numbers in string without -, use .replace() as:
>>> string_list = "5-9-7-1-7-8-3-2-4-8-7-9"
>>> string_list.replace('-', '')
'597178324879'
If you want the numbers as list of numbers, use .split():
>>> string_list.split('-')
['5', '9', '7', '1', '7', '8', '3', '2', '4', '8', '7', '9']
This looks a lot like the following question:
Python: Removing spaces from list objects
The answer being to use strip instead of replace. Have you tried
abc = x.strip(' ') for x in x

Find first x matches with re.findall

I need limit re.findall to find first 3 matches and then stop.
for example
text = 'some1 text2 bla3 regex4 python5'
re.findall(r'\d',text)
then I get:
['1', '2', '3', '4', '5']
and I want:
['1', '2', '3']
re.findall returns a list, so the simplest solution would be to just use slicing:
>>> import re
>>> text = 'some1 text2 bla3 regex4 python5'
>>> re.findall(r'\d', text)[:3] # Get the first 3 items
['1', '2', '3']
>>>
To find N matches and stop, you could use re.finditer and itertools.islice:
>>> import itertools as IT
>>> [item.group() for item in IT.islice(re.finditer(r'\d', text), 3)]
['1', '2', '3']

python regex extraction of fields using re.compile

array= ['gmond 10-22:13:29','bash 12-25:13:59']
regex = re.compile(r"((\d+)\-)?((\d+):)?(\d+):(\d+)$")
for key in array :
res = regex.match(key)
if res:
print res.group(2)
print res.group(5)
print res.group(6)
I know I am doing it wrong . But I tried several things , and failed. Can some one help me how can I fetch the patter macthes using group or any better way. I want to fetch the digits if the pattern is matched. This works so smooth with re.search but have to do it using re.compile in this case. Appreciate ur help.
You can use re.findall if you are sure of the format the elements of array:
>>> import re
>>> array = ["10-22:13:29", "12-25:13:59"]
>>> regex = re.compile(r"\d+")
>>> for key in array:
... res = regex.findall(key)
... if res:
... print res
...
['10', '22', '13', '29']
['12', '25', '13', '59']
You can use search with compile just as well. (match matches only at the beginning of the )
You are catching - and :, also, you have redundant brackets. Here's the code with modified regex:
import re
array = ["10-22:13:29", "12-25:13:59"]
regex = re.compile(r"^(\d+)\-?(\d+):?(\d+):?(\d+)$")
for key in array:
res = regex.match(key)
if res:
print res.groups()
prints:
('10', '22', '13', '29')
('12', '25', '13', '59')
See, all digits are extracted properly.

Categories