Regex re.findall() search to extract unit beginning with # and postcode - python

I am using Python 3.6 and trying to extract some building unit that starts with # in a string and some postcode using re.findall() (following explanation obtained here Extracting phone numbers from a free form text in python by using regex). I don't know exactly how the structure works and I do not get the result I am looking for.
Here is my code
string='Road #10-13, Tree 26739 #23.04 934047 Holiday'
re.findall(r'[#][0-9(\)][0-9 ,\.\-\(\)]{8,}[0-9 ,\(\)]', string)
Basically I would like to obtain something like
['#10-13,','#23.04 934047 ']
But I only obtain because there is a comma after #10-13:
['#23.04 934047 ']
What I want to change in my query is saying the string as to end with a number between 0-9 OR ','. Because even if I change the string and add a ',' after #23.04 I would still get the same result.
Could someone also explain to me the meaning of {8,} ?

Your problem is not the comma. You problem is that {8,} requires a match with 8 or more chars abd #10-13, has only 7 total, 5 for that part. Changing it to {5,} makes it work:
>>> re.findall(r'[#][0-9(\)][0-9 ,\.\-\(\)]{5,}[0-9 ,\(\)]', string)
['#10-13, ', '#23.04 934047 ']
I would use a simpler approach though, not sure if it matches all your requirements but it certainly works here:
>>> re.findall(r'#[-,.\d ()]+', string)
['#10-13, ', '#23.04 934047 ']

You can use a look-ahead. ie, extract part of the string that starts with an # then followed by anything as long as there is a non word character(s) eg space or , that are immediately followed by letters
re.findall("#.+?(?=\\W+[A-Z])",string)
['#10-13', '#23.04 934047']

I feel the regex could be a lot simpler
string='Road #10-13, Tree 26739 #23.04 934047 Holiday'
re.findall(r'#[\d\- \.]+', string)
outputs:
['#10-13, ', '#23.04 934047 ']

Related

How to create a regular expression that would find all pieces of text BETWEEN certain sets of characters?

I have a string that looks like 'E10 1/05/03 2/3211 3/AO Yuzhmor'.
The pieces that i need to extract are the ones following ' \d\/':
1) 05/03
2) 3211
3) AO Yuzhmor
My last idea was ' \d\/(.*?)(?=(( \d\/)|\Z))'
but it still wouldn't work properly on the last piece (the |\Z instruction doesn't seem to do anything).
I think you're close. This works for your example:
>>> s = 'E10 1/05/03 2/3211 3/AO Yuzhmor'
>>> re.findall('\s\d\/(.*?)(?=\s\d\/|$)', s)
['05/03', '3211', 'AO Yuzhmor']
Explanation:
Match on [space][digit]/, capturing everything that follows using a non-greedy quantifier, until the current position is immediately before either another [space][digit]/ (detected using a lookahead, matched but not consumed) or the end of the input. Use findall to return all matching instances in the input.
This can be tricky because we don't know all of the rules of how these strings are built. One option is to use your regex to split the string
>>> re.split(r" \d/", 'E10 1/05/03 2/3211 3/AO Yuzhmor')[1:]
['05/03', '3211', 'AO Yuzhmor']
Another is to be more specific about the fields, assuming that they are always " 1/", " 2/" and " 3/"
>>> re.match(r".*?1/(.*?) 2/(.*?) 3/(.*)", 'E10 1/05/03 2/3211 3/AO Yuzhmor').groups()
('05/03', '3211', 'AO Yuzhmor')
Try
re.findall('\d/(\S+)', s)
:)

Python Regex Simple Split - Empty at first index

I have a String that looks like
test = '20170125NBCNightlyNews'
I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)
I am trying to use re. I have a working version by writing.
re.split('(\d+)',test)
Simple enough, this gives me the values I need in a list.
['', '20170125', 'NBCNightlyNews']
However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.
I also tried telling it to match the begininning of the string as well, and got the same results.
>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>
Does anyone have any input as to why this is there / how I can avoid the empty string?
Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:
test[:8], test[8:]
Will split your strings just fine.
What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.
So, if you have
test = '20170125NBCNightlyNews'
This is happening:
20170125 NBCNightlyNews
^^^^^^^^
The string is split into three parts, everything before the number, the number itself and everything after the number.
Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.
re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']
re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']
You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.
To avoid that you can use filter:
>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']
Why re.split when you can just match and get the groups?...
import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')
result = re.match(pattern, test)
result.groups()[0] # for the date part
result.groups()[1] # for the show name
I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.
From the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.
So if you have:
test = 'test20170125NBCNightlyNews'
The indexes would remain unaffected:
>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']
If the date is always 8 digits long, I would access the substrings directly (without using regex):
>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']
If the length of the date might vary, I would use:
>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

How do I strip patterns or words from the end of the string backwards?

I have a string like this:
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>
I would like to strip the first 3 opening and the last 3 closing tags from the string. I do not know the tag names in advance.
I can strip the first 3 strings with re.sub(r'<[^<>]+>', '', in_str, 3)). How do I strip the closing tags? What should remain is:
<v1>aaa<b>bbb</b>ccc</v1>
I know I could maybe 'do it right', but I actually do not wish to do xml nor html parsing for my purpose, which is to aid myself visualizing the xml representation of some classes.
Instead, I realized that this problem is interesting. It seems I cannot simply search backwards with regex, ie. right to left. because that seems unsupported:
If you mean, find the right-most match of several (similar to the
rfind method of a string) then no, it is not directly supported. You
could use re.findall() and chose the last match but if the matches can
overlap this may not give the correct result.
But .rstrip is not good with words, and won't do patterns either.
I looked at Strip HTML from strings in Python but I only wish to strip up to 3 tags.
What approach could be used here? Should I reverse the string (ugly in itself and due to the '<>'s). Do tokenization (why not parse, then?)? Or create static closing tags based on the left-to-right match?
Which strategy to follow to strip the patterns from the end of the string?
The simplest would be to use old-fashing string splitting and limiting the split:
in_str.split('>', 3)[-1].rsplit('<', 3)[0]
Demo:
>>> in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar><foo>'
>>> in_str.split('>', 3)[-1].rsplit('<', 3)[0]
'<v1>aaa<b>bbb</b>ccc</v1>'
str.split() and str.rsplit() with a limit will split the string from the start or the end up to the limit times, letting you select the remainder unsplit.
You've already got practically all the solution. re can't do backwards, but you can:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
in_str = re.sub(r'<[^<>]+>', '', in_str, 3)
in_str = in_str[::-1]
print in_str
in_str = re.sub(r'>[^<>]+/<', '', in_str, 3)
in_str = in_str[::-1]
print in_str
<v1>aaa<b>bbb</b>ccc</v1>
Note the reversed regex for the reversed string, but then it goes back-to-front.
Of course, as mentioned, this is way easier with a proper parser:
in_str = '<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>'
from lxml.html import etree
ix = etree.fromstring(in_str)
print etree.tostring(ix[0][0][0])
<v1>aaa<b>bbb</b>ccc</v1>
I would look into regular expressions and use one such pattern to use a split
http://docs.python.org/3/library/re.html?highlight=regex#re.regex.split
Sorry, can't comment, but will give it as an answer.
in_str.split('>', 3)[-1].rsplit('<', 3)[0] will work for the given example
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo>, but not for
<foo><bar><k2><v1>aaa<b>bbb</b>ccc</v1></k2></bar></foo><another>test</another>.
You just should be aware of this.
To solve the counter example provided by me, you will have to track state (or count) of tags and evaluate that you match the correct pairs.

Checking and removing extra symbols

I'm interested by removing extra symbols from strings in python.
What could by the more efficient and pythonic way to do that ? Is there some grammar module ?
My first idea would be to locate the more nested text and go through the left and the right, counting the opening and closing symbols. Then i remove the last one of the symbol counter that contain too much symbol.
An example would be this string
text = "(This (is an example)"
You can clearly see that the first parenthesis is not balanced by another one. So i want to delete it.
text = "This (is and example)"
The solution has to be independant of the position of the parentheses.
Others example could be :
text = "(This (is another example) )) (to) explain) the question"
That would become :
text = "(This (is another example) ) (to) explain the question"
Had to break this into an answer for formatting. Check the Python's regular expression module.
If I'm understanding what you are asking, look at re.sub. You can use a regular expression to find the character you'd like to remove, and replace them with an empty string.
Suppose we want to remove all instances of '.', '&', and '*'.
>>> import re
>>> s = "abc&def.ghi**jkl&"
>>> re.sub('[\.\&\*]', '', s)
'abcdefghijkl'
If the pattern to be matched is larger, you can use re.compile and pass that as the first argument to sub.
>>> r = re.compile('[\.\&\*]')
>>> re.sub(r, '', s)
'abcdefghijkl'
Hope this helps.

Matching two almost similar string (python)

In a file I can have either of the following two string formats:
::WORD1::WORD2= ANYTHING
::WORD3::WORD4::WORD5= ANYTHING2
This is the regex I came up with:
::(\w+)(?:::(\w+))?::(\w+)=(.*)
regex.findall(..)
[(u'WORD1', u'', u'WORD2', u' ANYTHING'),
(u'WORD3', u'WORD4', u'WORD5', u' ANYTHING2')]
My first question is, why do I get this empty u'' when matching the first string ?
My second question is, is there an easier way to write this regex? the two strings are very similar, except that sometimes i have this extra ::WORD5
My last question is: most of the time I have only word between the :: so that's why \w+ is enough, but sometime I can get stuff like 2-WORD2 or 3-2-WORD2 etc.. there is this - that appears. How can I add it into the \w+ ?
for last question:
[\w\-]+
explain:
\w
Matches any word character.
Captured groups are always included in re.findall results, even if they don't match anything. That's why you get an empty string. If you just want to get what's between the delimiters, try split instead of findall:
a = '::WORD1::WORD2= ANYTHING'
b = '::WORD3::WORD4::WORD5= ANYTHING2'
print re.split(r'::|= ', a)[1:] # ['WORD1', 'WORD2', 'ANYTHING']
print re.split(r'::|= ', b)[1:] # ['WORD3', 'WORD4', 'WORD5', 'ANYTHING2']
In response to the comments, if "ANYTHING" could be well, anything, it's easier to use string functions rather than regexps:
x, y = a.split('= ', 1)
results = x.split('::')[1:] + [y]
Based on the answer of thg435 you can just split to the "=" and then do exactly the same somethign like
left,right = a.split('=', 1)
answer = left.split('::')[1:] + [right]
For you last question you can do something like (that accept letters, numbers and "-")
[a-zA-Z0-9\-]+

Categories