How can I match the whole regex not the subexpression - python

Say, I have the following regex to search a series of room number:
import re
re.findall(r'\b(\d)\d\1\b','101 102 103 201 202 203')
I want to search for the room number whose first and last digit are the same (101 and 202). The above code gives
['1','2']
which corresponding to the subexpression (\d). But how can it return the whole room number like 101 and 202?

import re
print [i for i,j in re.findall(r'\b((\d)\d\2)\b','101 102 103 201 202 203')]
or
print [i[0] for i in re.findall(r'\b((\d)\d\2)\b','101 102 103 201 202 203')]
You can use list comprehension here.You need only room numbers so include only i.basically re.findall return all groups in a regex.So you need 2 groups.The first is will have room numbers and second will be used for matching.So we can extract just the first out of the tuple of 2.

Related

Regex to unify a format of phone numbers in Python

I'm trying a regex to match a phone like +34(prefix), single space, followed by 9 digits that may or may not be separated by spaces.
+34 886 24 68 98
+34 980 202 157
I would need a regex to work with these two example cases.
I tried this ^(\+34)\s([ *]|[0-9]{9}) but is not it.
Ultimately I'll like to match a phone like +34 "prefix", single space, followed by 9 digits, no matter what of this cases given. For that I'm using re.sub() function but I'm not sure how.
+34 886 24 68 98 -> ?
+34 980 202 157 -> ?
+34 846082423 -> `^(\+34)\s(\d{9})$`
+34920459596 -> `^(\+34)(\d{9})$`
import re
from faker import Faker
from faker.providers import BaseProvider
#fake = Faker("es_ES")
class CustomProvider(BaseProvider):
def phone(self):
#phone = fake.phone_number()
phone = "+34812345678"
return re.sub(r'^(\+34)(\d{9})$', r'\1 \2', phone)
You can try:
^\+34\s*(?:\d\s*){9}$
^ - beginning of the string
\+34\s* - match +34 followed by any number of spaces
(?:\d\s*){9} - match number followed by any number of spaces 9 times
$ - end of string
Regex demo.
Here's a simple approach: use regex to get the plus sign and all the numbers into an array (one char per element), then use other list and string manipulation operations to format it the way you like.
import re
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
pattern = r'[+\d]'
m1 = re.findall(pattern, p1)
m2 = re.findall(pattern, p2)
m1_str = f"{''.join(m1[:3])} {''.join(m1[3:])}"
m2_str = f"{''.join(m2[:3])} {''.join(m2[3:])}"
print(m1_str) # +34 886246898
print(m2_str) # +34 980202157
Or removing spaces using string replacement instead of regex:
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
p1_compact = p1.replace(' ', '')
p2_compact = p2.replace(' ', '')
p1_str = f"{p1_compact[:3]} {p1_compact[3:]}"
p2_str = f"{p2_compact[:3]} {p2_compact[3:]}"
print(p1_str) # +34 886246898
print(p2_str) # +34 980202157
I would capture the numbers like this: r"(\+34(?:\s?\d){9})".
That will allows you to search for numbers allowing whitespace to optionally be placed before any of them. Using a non-capturing group ?: to allow repeating \s?\d without having each number listed as a group on its own.
import re
nums = """
Number 1: +34 886 24 68 98
Number 2: +34 980 202 157
Number 3: +34812345678
"""
number_re = re.compile(r"(\+34(?:\s?\d){9})")
for match in number_re.findall(nums):
print(match)
+34 886 24 68 98
+34 980 202 157
+34812345678

Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)

Here is my sample data:
import pandas as pd
import re
cars = pd.DataFrame({'Engine Information': {0: 'Honda 2.4L 4 cylinder 190 hp 162 ft-lbs',
1: 'Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs',
2: 'Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs',
3: 'MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs',
4: 'Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV',
5: 'GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs'},
'HP': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None}})
Here is my desired output:
I have created a new column called 'HP' where I want to extract the horsepower figure from the original column ('Engine Information')
Here is the code I have tried to do this:
cars['HP'] = cars['Engine Information'].apply(lambda x: re.match(r'\\d+(?=\\shp|hp)', str(x)))
The idea is I want to regex match the pattern: 'a sequence of numbers that come before either 'hp' or ' hp'. This is because some of the cells have no 'space' in between the number and 'hp' as showed in my example.
I'm sure the regex is correct, because I have successfully done a similar process in R. However, I have tried functions such as str.extract, re.findall, re.search, re.match. Either returning errors or 'None' values (as shown in the sample). So here I am a bit lost.
Thanks!
You can use str.extract:
cars['HP'] = cars['Engine Information'].str.extract(r'(\d+)\s*hp\b', flags=re.I)
Details
(\d+)\s*hp\b - matches and captures into Group 1 one or more digits, then just matches 0 or more whitespaces (\s*) and hp (in a case insensitive way due to flags=re.I) as a whole word (since \b marks a word boundary)
str.extract only returns the captured value if there is a capturing group in the pattern, so the hp and whitespaces are not part of the result.
Python demo results:
>>> cars
Engine Information HP
0 Honda 2.4L 4 cylinder 190 hp 162 ft-lbs 190
1 Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs 420
2 Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs 390
3 MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs 118
4 Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV 360
5 GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs 352
There are several problems:
re.match just looks at the beginning of your string, use re.search if your pattern may appear anywhere
don't escape if you use a raw string, i.e. either'\\d hp' or r'\d hp' - raw strings help your exactly to avoid escaping
Return the matched group. You just search but do not yield the group found. re.search(rex, string) gives you a complex object (a match object) from this you can extract all groups, e.g. re.search(rex, string)[0]
you have to wrap the access in a separate function because you have to check if there was any match before accessing the group. If you don't do that, an exception may stop the apply process right in the middle
apply is slow; use pandas vectorized functions like extract: cars['Engine Information'].str.extract(r'(\d+) ?hp')
Your approach should work with this:
def match_horsepower(s):
m = re.search(r'(\d+) ?hp', s)
return int(m[1]) if m else None
cars['HP'] = cars['Engine Information'].apply(match_horsepower)
This is will get numeric value just before hp, without or with (single or multiple) spaces.
r'\d+(?=\s+hp|hp)'
You can verify Regex Here: https://regex101.com/r/pXySxm/1

Groups in regex python

Below is a piece of code I have been working on and not getting the desired result. I would like to only use groups to split
elements into group 1 and group 2. For the last two elements I would like to only match 0567 and not 567 for group 2. I get the desired result for '+1 234 567' but not for '0567' or '567'. Please help with this.
regex_str = "^(?:\+1)?\s?([123456789]\d{2})\s?([123456789]\d{2})"
PATTERN = re.compile(regex_str)
num = ['+1 234 567','0567', '567']
for i in num:
m = PATTERN.match(i)
if m != None:
print (i," and ",m.group(1),m.group(2))
else:
print (i, " has no match")
output:
+1 234 567 and 234 567
0789 has no match
789 has no match
Use a ? after group 1 to make it optional.
regex_str = "^(?:\+1)?\s?([123456789]\d{2})?\s?([123456789]\d{2})"
Thanks for the suggestion using ? works to make group 1 optional. In addition, I had to use ?(1) to make sure group 1 has a correct match before proceeding to group 2 that gave me the answer I desired. Thanks for the help.

Python regex for UK number

Below given are the UK phone numbers need to fetch from text file:
07791523634
07910221698
But it only print 0779152363, 0791022169 skipping the 11th character.
Also it produce unnecessary values like ('')
Ex : '', '07800 854536'
Below is the regex I've used:
phnsrch = re.compile(r'\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{5}|\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|/^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$/|')
Need help to fetch the complete set of 11 numbers without any unnecessary symbols
Finally figured out the solution for matching the UK numbers below:
07540858798
0113 2644489
02074 735 217
07512 850433
01942 896007
01915222200
01582 492734
07548 021 475
020 8563 7296
07791523634
re.compile(r'\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|(?:\d{4}\)?[\s-]?\d{3}[\s-]?\d{4})')
Thanks to those who helped me with this issue.
I think your regex is too long and can be more easier, try this regex instead:
^(07\d{8,12}|447\d{7,11})$

Making a decryption program in python

So a little while ago I asked for some help with an encryption program,
And you guys were amazing and came up with the solution.
So I come to you again in search of help for the equivalent decryption program.
The code I have got so far is like this:
whinger = 0
bewds = raw_input ('Please enter the encrypted message: ')
bewds = bewds.replace(' ', ', ')
warble = [bewds]
print warble
wetler = len(warble)
warble.reverse();
while whinger < wetler:
print chr(warble[whinger]),
whinger += 1
But when I input
101 103 97 115 115 101 109
it comes up with the error that the input is not an integer.
What I need is when I enter the numbers it turns them into a list of integers.
But I don't want to have to input all the numbers separately.
Thanks in advance for your help :P
To convert input string into a list of integers:
numbers = [int(s) for s in "101 103 97 115 115 101 109".split()]
Here's almost the simplest way I can think of to do it:
s = '101 103 97 115 115 101 109'
numbers = []
for number_str in s.replace(',', ' ').split():
numbers.append(int(number_str))
It will allow the numbers to be separated with commas and/or one or more space characters. If you only want to allow spaces, leave the ".replace(',', ' ')" out.
Your problem is, that raw_input returns a string to you. So you have two options.
1, Use regular expression library re. E.G.:
import re
bewds = raw_input ('Please enter the encrypted message: ')
some_list = []
for find in re.finditer("\d+", bewds):
some_list.append(find.group(0))
2, Or you can use split method as described in the most voted answer to this question: sscanf in Python
You could also use map
numbers = map(int, '101 103 97 115 115 101 109'.split())
This returns a list in Python 2, but a map object in Python 3, which you might want to convert into a list.
numbers = list(map(int, '101 103 97 115 115 101 109'.split()))
This does exactly the same as J. F. Sebastian's answer.

Categories