python regex extraction of fields using re.compile

python regex extraction of fields using re.compile - python

array= ['gmond 10-22:13:29','bash 12-25:13:59']
regex = re.compile(r"((\d+)\-)?((\d+):)?(\d+):(\d+)$")
for key in array :
res = regex.match(key)
if res:
print res.group(2)
print res.group(5)
print res.group(6)
I know I am doing it wrong . But I tried several things , and failed. Can some one help me how can I fetch the patter macthes using group or any better way. I want to fetch the digits if the pattern is matched. This works so smooth with re.search but have to do it using re.compile in this case. Appreciate ur help.

You can use re.findall if you are sure of the format the elements of array:
>>> import re
>>> array = ["10-22:13:29", "12-25:13:59"]
>>> regex = re.compile(r"\d+")
>>> for key in array:
... res = regex.findall(key)
... if res:
... print res
...
['10', '22', '13', '29']
['12', '25', '13', '59']

You can use search with compile just as well. (match matches only at the beginning of the )

You are catching - and :, also, you have redundant brackets. Here's the code with modified regex:
import re
array = ["10-22:13:29", "12-25:13:59"]
regex = re.compile(r"^(\d+)\-?(\d+):?(\d+):?(\d+)$")
for key in array:
res = regex.match(key)
if res:
print res.groups()
prints:
('10', '22', '13', '29')
('12', '25', '13', '59')
See, all digits are extracted properly.

Related

Regex .search with grouping is not collecting groups

I am trying to search through the following list
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
using this code:
next_page = re.compile(r'/(\d+)_p/$')
matches = list(filter(next_page.search, href_search)) #search or .match
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
and am getting the following error: AttributeError: 'str' object has no attribute 'group'.
I thought that the parenthesis around the \d+ would group the one or more numbers. My goal is to get the number preceding "_p/" at the end of the string.

You are filtering your original list, so what is being returned are the original strings, not the match objects. If you want to return the match objects, you need to map the search to the list, then filter the match objects. For example:
next_page = re.compile(r'/(\d+)_p/$')
matches = filter(lambda m:m is not None, map(next_page.search, href_search))
for match in matches:
#refining_nextpage = re.compile()
print(match.group())
Output:
/2_p/
/3_p/
/6_p/
/7_p/
/8_p/
/2_p/
If you only want the number part of the match, use match.group(1) instead of match.group().

I think re.findall should do the trick:
next_page.findall(href_search) # ['2', '3', '6', '7', '8', '2']
Alternatively, you could split the lines and then search them individually:
matches = []
for line in href_search.splitlines():
match = next_page.search(line)
if match:
matches.append(match.group(1))
matches # ['2', '3', '6', '7', '8', '2']

You can try this:
import re
# add re.M to match the end of each line
next_page = re.compile(r'/(\d+)_p/$', re.M)
matches = next_page.findall(href_search)
print(matches)
It gives:
['2', '3', '6', '7', '8', '2']

The filter function will only remove the lines that don't match the regex and will return the string, eg:
>>> example = ["abc", "def", "ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> list(filter(my_match.search, example))
['123']
If you want the match object then a list comprehension could do the trick:
>>> example = ["abc", "def45", "67ghi", "123"]
>>> my_match = re.compile(r"\d+$")
>>> [my_match.search(line) for line in example] # Get the matches
[None,
<re.Match object; span=(3, 5), match='45'>,
None,
<re.Match object; span=(0, 3), match='123'>]
>>> [match.group() for match in [my_match.search(line) for line in example] if match is not None] # Filter None values
['45', '123']

You can do regex (?<=\/)\d+(?=\_p\/$). See regex101 as example
Explanation:
(?<=\/) : Look behind for /
\d+ : Look for one or more digits
(?=\_p\/$) : Look ahead for _p/ at the end of string
If there is a match, then return only \d+ value.
You can either write the code to grab all the data at once or iterate through them line by line and get the data you need.
Below is the code for both:
text_line = '''/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/3_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/6_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/7_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/8_p/
/for_sale/44.97501,46.22024,-124.82303,-123.01166_xy/0-150000_price/LOT%7CLAND_type/9_zm/2_p/'''
import re
for txt in text_line.split('\n'):
t = re.findall(r'(?<=\/)\d+(?=\_p\/$)',txt)
print (t)
t = re.findall(r'(?<=\/)\d+(?=\_p\/)',text_line)
print (t)
The first part does it line by line and the second result is to grab it one shot.
Output of both are:
Line by line:
['2']
['3']
['6']
['7']
['8']
['2']
Grab all at once:
['2', '3', '6', '7', '8', '2']
For the second one, I didn't give the $ sign as we need to grab all of it.

Python regular expression retrieving numbers between two different delimiters

I have the following string
"h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
I would like to use regular expressions to extract the groups:
group1 56,7,1
group2 88,9,1
group3 58,8,1
group4 45
group5 100
group6 null
My ultimate goal is to have tuples such as (group1, group2), (group3, group4), (group5, group6). I am not sure if this all can be accomplished with regular expressions.
I have the following regular expression with gives me partial results
(?<=h=|d=)(.*?)(?=h=|d=)
The matches have an extra comma at the end like 56,7,1, which I would like to remove and d=, is not returning a null.

You likely do not need to use regex. A list comprehension and .split() can likely do what you need like:
Code:
def split_it(a_string):
if not a_string.endswith(','):
a_string += ','
return [x.split(',')[:-1] for x in a_string.split('=') if len(x)][1:]
Test Code:
tests = (
"h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,",
"h=56,7,1,d=88,9,1,d=,h=58,8,1,d=45,h=100",
)
for test in tests:
print(split_it(test))
Results:
[['56', '7', '1'], ['88', '9', '1'], ['58', '8', '1'], ['45'], ['100'], ['']]
[['56', '7', '1'], ['88', '9', '1'], [''], ['58', '8', '1'], ['45'], ['100']]

You could match rather than split using the expression
[dh]=([\d,]*),
and grab the first group, see a demo on regex101.com.
That is
[dh]= # d or h, followed by =
([\d,]*) # capture d and s 0+ times
, # require a comma afterwards
In Python:
import re
rx = re.compile(r'[dh]=([\d,]*),')
string = "h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
numbers = [m.group(1) for m in rx.finditer(string)]
print(numbers)
Which yields
['56,7,1', '88,9,1', '58,8,1', '45', '100', '']

You can use ([a-z]=)([0-9,]+)(,)?
Online demo
just you need add index to group

You could use $ in positive lookahead to match against the end of the string:
import re
input_str = "h=56,7,1,d=88,9,1,h=58,8,1,d=45,h=100,d=,"
groups = []
for x in re.findall('(?<=h=|d=)(.*?)(?=d=|h=|$)', input_str):
m = x.strip(',')
if m:
groups.append(m.split(','))
else:
groups.append(None)
print(groups)
Output:
[['56', '7', '1'], ['88', '9', '1'], ['58', '8', '1'], ['45'], ['100'], None]

Here, I have assumed that parameters will only have numerical values. If it is so, then you can try this.
(?<=h=|d=)([0-9,]*)
Hope it helps.

Replace the particular value of array by some other value using python for loop

staff_text=['31','32']
staffing_title = ['14','28','14','20']
I have two array like above.and i want output like
staffing_title = ['31','28','32','20']
So basically whenever 14 comes in staffing_title array it replace by staff_text values.
ex if first 14 comes replace by 31,When second 14 comes replace by 32 and so on

Here is the one liner using list comprehension :
>>> staffing_title = ['14', '28', '14', '20']
>>> staff_text=['31','32']
>>> res = [staff_text.pop(0) if item == str(14) else item for item in staffing_title ]
>>> print(res)
['31', '28', '32', '20']

The following will do it:
>>> [t if t != '14' else staff_text.pop() for t in staffing_title]
['32', '28', '31', '20']
Note that this modifies staff_text, so you might want to make it operate on a copy.
This code assumes that there are at least as many elements in staff_text as there are '14' strings in staffing_title (but then you don't specify what should happen if there aren't).

How to remove whitespace in a list

I can't remove my whitespace in my list.
invoer = "5-9-7-1-7-8-3-2-4-8-7-9"
cijferlijst = []
for cijfer in invoer:
cijferlijst.append(cijfer.strip('-'))
I tried the following but it doesn't work. I already made a list from my string and seperated everything but the "-" is now a "".
filter(lambda x: x.strip(), cijferlijst)
filter(str.strip, cijferlijst)
filter(None, cijferlijst)
abc = [x.replace(' ', '') for x in cijferlijst]

Try that:
>>> ''.join(invoer.split('-'))
'597178324879'

If you want the numbers in string without -, use .replace() as:
>>> string_list = "5-9-7-1-7-8-3-2-4-8-7-9"
>>> string_list.replace('-', '')
'597178324879'
If you want the numbers as list of numbers, use .split():
>>> string_list.split('-')
['5', '9', '7', '1', '7', '8', '3', '2', '4', '8', '7', '9']

This looks a lot like the following question:
Python: Removing spaces from list objects
The answer being to use strip instead of replace. Have you tried
abc = x.strip(' ') for x in x

Process a list and output as list

I'm quite new to python and have a question about processing a list with a list as result.
Example:
list1 = ["vbhg12vbdf42vbsdh24", "dbsh13vdsj24lvk48"] #must become [['12','42','24'], ['13','24','48']]
list2 = (re.findall("\d+", str(list1))) # gives ['12', '42', '24', '13', '24', '48']
See comments. Any idea how I can do this?
Much appreciated.

First of all you need to specify that your pattern is a regex in your findall() function with add r at beginning of your pattern, then you need to loop over your list and apply the function on its element,You can use a list comprehension :
>>> list1 = ["vbhg12vbdf42vbsdh24", "dbsh13vdsj24lvk48"]
>>> import re
>>> [re.findall(r'\d+',i) for i in list1]
[['12', '42', '24'], ['13', '24', '48']]

How about:
result = []
for x in list1:
result.append(re.findall("\d+", x))
Or, as a list comprehension:
result = [re.findall("\d+", x) for x in list1]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex extraction of fields using re.compile - python

You can use re.findall if you are sure of the format the elements of array: >>> import re >>> array = ["10-22:13:29", "12-25:13:59"] >>> regex = re.compile(r"\d+") >>> for key in array: ... res = regex.findall(key) ... if res: ... print res ... ['10', '22', '13', '29'] ['12', '25', '13', '59']

You can use search with compile just as well. (match matches only at the beginning of the )

Related

Regex .search with grouping is not collecting groups

Python regular expression retrieving numbers between two different delimiters

Replace the particular value of array by some other value using python for loop

How to remove whitespace in a list

Process a list and output as list

Categories

Resources