I am creating a function which takes in a street plus house number and possibly an addition and returns the house number. I could simply extract the integers, but the problem is that there can be an integer in the street name itself. For instance:
my_string = 'Hendrik 4e laan 18 bis'
In this case, I would like to return:
street_name = 'Hendrik 4e laan', street_number = 18, street_number_addition = 'bis'
I cannot simply split the string on spaces and take last integer ([x for x in my_string.split() if x.isdigit()][-1]), because the streetnumber addition might be attached to the street number (e.g. 18bis or 18b).
Hence how do I get a list of [4, 18] such that I can simply take the last item?
This works for me:
>>> import re
>>> my_string = 'Hendrik 4e laan 18 bis'
>>> re.findall( r'(\d+)', my_string )
['4', '18']
>>>
If you know that you only want the last occurrence, then another option, instead of finding all of them (which you could do using findall - see lenik's answer) is to search for a sequence of digits after which there are no more digits before the end of the string:
import re
my_string = 'Hendrik 4e laan 18 bis'
match = re.search(r'(\d+)\D*$', my_string)
if match:
number = int(match.group(1)) # gives here: number = 18
else:
print("no digits found")
Explanation of regular expression:
(\d+) one or more digits (with grouping parentheses, so that we can use the .group(1) to extract them)
\D* - zero or more characters that are not digits
$ - the end of the string
Related
I would like to construct a reg expression pattern for the following string, and use Python to extract:
str = "hello w0rld how 34 ar3 44 you\n welcome 200 stack000verflow\n"
What I want to do is extract the independent number values and add them which should be 278. A prelimenary python code is:
import re
x = re.findall('([0-9]+)', str)
The problem with the above code is that numbers within a char substring like 'ar3' would show up. Any idea how to solve this?
Why not try something simpler like this?:
str = "hello w0rld how 34 ar3 44 you\n welcome 200 stack000verflow\n"
print sum([int(s) for s in str.split() if s.isdigit()])
# 278
s = re.findall(r"\s\d+\s", a) # \s matches blank spaces before and after the number.
print (sum(map(int, s))) # print sum of all
\d+ matches all digits. This gives the exact expected output.
278
How about this?
x = re.findall('\s([0-9]+)\s', str)
The solutions posted so far only work (if at all) for numbers that are preceded and followed by whitespace. They will fail if a number occurs at the very start or end of the string, or if a number appears at the end of a sentence, for example. This can be avoided using word boundary anchors:
s = "100 bottles of beer on the wall (ignore the 1000s!), now 99, now only 98"
s = re.findall(r"\b\d+\b", a) # \b matches at the start/end of an alphanumeric sequence
print(sum(map(int, s)))
Result: 297
To avoid a partial match
use this:
'^[0-9]*$'
I need to do a string compare to see if 2 strings are equal, like:
>>> x = 'a1h3c'
>>> x == 'a__c'
>>> True
independent of the 3 characters in middle of the string.
You need to use anchors.
>>> import re
>>> x = 'a1h3c'
>>> pattern = re.compile(r'^a.*c$')
>>> pattern.match(x) != None
True
This would check for the first and last char to be a and c . And it won't care about the chars present at the middle.
If you want to check for exactly three chars to be present at the middle then you could use this,
>>> pattern = re.compile(r'^a...c$')
>>> pattern.match(x) != None
True
Note that end of the line anchor $ is important , without $, a...c would match afoocbarbuz.
Your problem could be solved with string indexing, but if you want an intro to regex, here ya go.
import re
your_match_object = re.match(pattern,string)
the pattern in your case would be
pattern = re.compile("a...c") # the dot denotes any char but a newline
from here, you can see if your string fits this pattern with
print pattern.match("a1h3c") != None
https://docs.python.org/2/howto/regex.html
https://docs.python.org/2/library/re.html#search-vs-match
if str1[0] == str2[0]:
# do something.
You can repeat this statement as many times as you like.
This is slicing. We're getting the first value. To get the last value, use [-1].
I'll also mention, that with slicing, the string can be of any size, as long as you know the relative position from the beginning or the end of the string.
How can I parse a string ['FED590498'] in python, so than I can get all numeric values 590498 and chars FED separately.
Some Samples:
['ICIC889150']
['FED889150']
['MFL541606']
and [ ] is not part of string...
If the number of letters is variable, it's easiest to use a regular expression:
import re
characters, numbers = re.search(r'([A-Z]+)(\d+)', inputstring).groups()
This assumes that:
The letters are uppercase ASCII
There is at least 1 character, and 1 digit in each input string.
You can lock the pattern down further by using {3, 4} instead of + to limit repetition to just 3 or 4 instead of at least 1, etc.
Demo:
>>> import re
>>> inputstring = 'FED590498'
>>> characters, numbers = re.search(r'([A-Z]+)(\d+)', inputstring).groups()
>>> characters
'FED'
>>> numbers
'590498'
Given the requirement that there are always 3 or 4 letters you can use:
import re
characters, numbers = re.findall(r'([A-Z]{3,4})(\d+)', 'FED590498')[0]
characters, numbers
#('FED', '590498')
Or even:
ids = ['ICIC889150', 'FED889150', 'MFL541606']
[re.search(r'([A-Z]{3,4})(\d+)', id).groups() for id in ids]
#[('ICIC', '889150'), ('FED', '889150'), ('MFL', '541606')]
As suggested by Martjin, search is the preferred way.
I am looking for a regex in python to match everything before 19 and after 24.
File names are test_case_*.py, where the asterisk is a 1 or 2 digit number.
eg: test_case_1.py, test_case_27.py.
Initially, I thought something like [1-19] should work,it turned out to be much harder than I thought.
Has any one worked on a solution for such cases?
PS:i am ok even if we can find a one regex for all numbers before a number x and one for all numbers after a number y.
I wouldn't use a regex for validating the number itself, I would use one only for extracting the number, e.g.:
>>> import re
>>> name = 'test_case_42.py'
>>> num = int(re.match('test_case_(\d+).py', name).group(1))
>>> num
42
and then use something like:
num < 19 or num > 24
to ensure num is valid. The reason for this is that it's much harder to adapt a regex that does this than it is to adapt something like num < 19 or num > 24.
The following should do it (for matching the entire filename):
^test_case_([3-9]?\d|1[0-8]|2[5-9])\.py$
Explanation:
^ # beginning of string anchor
test_case_ # match literal characters 'test_case_' (file prefix)
( # begin group
[3-9]?\d # match 0-9 or 30-99
| # OR
1[0-8] # match 10-18
| # OR
2[5-9] # match 25-29
) # end group
\.py # match literal characters '.py' (file suffix)
$ # end of string anchor
Something like
"(?<=_)(?!(19|20|21|22|23|24)\.)[0-9]+(?=\.)"
One or more digits `[0-9]+`
that aren't 19-24 `(?!19|20|21|22|23|24)` followed by a .
following a _ `(?<=_)` and preceding a . `(?=\.)`
http://regexr.com?35rbm
Or more compactly
"(?<=_)(?!(19|2[0-4])\.)[0-9]+(?=\.)"
where the 20-24 range has been compacted.
I have a file with lines in following format:
d 55 r:100:10000
I would like to find that 55 and parse it to int. How can I do this ? I would like to make it variable-space-in-between-proof. Which means that there might be more or less spaces in between but it will be between d and r for sure.
That's easy:
number = int(line.split()[1])
If you actually need to check whether it's between d and r, then use
import re
number = int(re.search(r"d\s+(\d+)\s+r", line).group(1))
The str.split() method by default splits on arbitrary-width whitespace:
>>> 'd 55 r:100:10000'.split()
['d', '55', 'r:100:10000']
Picking out just the middle number then becomes a simple select:
>>> int('d 55 r:100:10000'.split()[1])
55
Quoting the documentation:
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
>>> line = 'd 55 r:100:10000'
>>> int(line.split()[1]) # By default splits by any whitespace
55
Or being more efficient, you can limit it to two splits (where the second one holds the integer):
>>> int(line.split(None, 2)[1]) # None means split by any whitespace
55