Regex to unify a format of phone numbers in Python - python

I'm trying a regex to match a phone like +34(prefix), single space, followed by 9 digits that may or may not be separated by spaces.
+34 886 24 68 98
+34 980 202 157
I would need a regex to work with these two example cases.
I tried this ^(\+34)\s([ *]|[0-9]{9}) but is not it.
Ultimately I'll like to match a phone like +34 "prefix", single space, followed by 9 digits, no matter what of this cases given. For that I'm using re.sub() function but I'm not sure how.
+34 886 24 68 98 -> ?
+34 980 202 157 -> ?
+34 846082423 -> `^(\+34)\s(\d{9})$`
+34920459596 -> `^(\+34)(\d{9})$`
import re
from faker import Faker
from faker.providers import BaseProvider
#fake = Faker("es_ES")
class CustomProvider(BaseProvider):
def phone(self):
#phone = fake.phone_number()
phone = "+34812345678"
return re.sub(r'^(\+34)(\d{9})$', r'\1 \2', phone)

You can try:
^\+34\s*(?:\d\s*){9}$
^ - beginning of the string
\+34\s* - match +34 followed by any number of spaces
(?:\d\s*){9} - match number followed by any number of spaces 9 times
$ - end of string
Regex demo.

Here's a simple approach: use regex to get the plus sign and all the numbers into an array (one char per element), then use other list and string manipulation operations to format it the way you like.
import re
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
pattern = r'[+\d]'
m1 = re.findall(pattern, p1)
m2 = re.findall(pattern, p2)
m1_str = f"{''.join(m1[:3])} {''.join(m1[3:])}"
m2_str = f"{''.join(m2[:3])} {''.join(m2[3:])}"
print(m1_str) # +34 886246898
print(m2_str) # +34 980202157
Or removing spaces using string replacement instead of regex:
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
p1_compact = p1.replace(' ', '')
p2_compact = p2.replace(' ', '')
p1_str = f"{p1_compact[:3]} {p1_compact[3:]}"
p2_str = f"{p2_compact[:3]} {p2_compact[3:]}"
print(p1_str) # +34 886246898
print(p2_str) # +34 980202157

I would capture the numbers like this: r"(\+34(?:\s?\d){9})".
That will allows you to search for numbers allowing whitespace to optionally be placed before any of them. Using a non-capturing group ?: to allow repeating \s?\d without having each number listed as a group on its own.
import re
nums = """
Number 1: +34 886 24 68 98
Number 2: +34 980 202 157
Number 3: +34812345678
"""
number_re = re.compile(r"(\+34(?:\s?\d){9})")
for match in number_re.findall(nums):
print(match)
+34 886 24 68 98
+34 980 202 157
+34812345678

Related

Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)

Here is my sample data:
import pandas as pd
import re
cars = pd.DataFrame({'Engine Information': {0: 'Honda 2.4L 4 cylinder 190 hp 162 ft-lbs',
1: 'Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs',
2: 'Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs',
3: 'MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs',
4: 'Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV',
5: 'GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs'},
'HP': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None}})
Here is my desired output:
I have created a new column called 'HP' where I want to extract the horsepower figure from the original column ('Engine Information')
Here is the code I have tried to do this:
cars['HP'] = cars['Engine Information'].apply(lambda x: re.match(r'\\d+(?=\\shp|hp)', str(x)))
The idea is I want to regex match the pattern: 'a sequence of numbers that come before either 'hp' or ' hp'. This is because some of the cells have no 'space' in between the number and 'hp' as showed in my example.
I'm sure the regex is correct, because I have successfully done a similar process in R. However, I have tried functions such as str.extract, re.findall, re.search, re.match. Either returning errors or 'None' values (as shown in the sample). So here I am a bit lost.
Thanks!
You can use str.extract:
cars['HP'] = cars['Engine Information'].str.extract(r'(\d+)\s*hp\b', flags=re.I)
Details
(\d+)\s*hp\b - matches and captures into Group 1 one or more digits, then just matches 0 or more whitespaces (\s*) and hp (in a case insensitive way due to flags=re.I) as a whole word (since \b marks a word boundary)
str.extract only returns the captured value if there is a capturing group in the pattern, so the hp and whitespaces are not part of the result.
Python demo results:
>>> cars
Engine Information HP
0 Honda 2.4L 4 cylinder 190 hp 162 ft-lbs 190
1 Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs 420
2 Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs 390
3 MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs 118
4 Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV 360
5 GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs 352
There are several problems:
re.match just looks at the beginning of your string, use re.search if your pattern may appear anywhere
don't escape if you use a raw string, i.e. either'\\d hp' or r'\d hp' - raw strings help your exactly to avoid escaping
Return the matched group. You just search but do not yield the group found. re.search(rex, string) gives you a complex object (a match object) from this you can extract all groups, e.g. re.search(rex, string)[0]
you have to wrap the access in a separate function because you have to check if there was any match before accessing the group. If you don't do that, an exception may stop the apply process right in the middle
apply is slow; use pandas vectorized functions like extract: cars['Engine Information'].str.extract(r'(\d+) ?hp')
Your approach should work with this:
def match_horsepower(s):
m = re.search(r'(\d+) ?hp', s)
return int(m[1]) if m else None
cars['HP'] = cars['Engine Information'].apply(match_horsepower)
This is will get numeric value just before hp, without or with (single or multiple) spaces.
r'\d+(?=\s+hp|hp)'
You can verify Regex Here: https://regex101.com/r/pXySxm/1

removing words from a list from pandas column - python 2.7

I have a text file which contains some strings that I want to remove from my data frame. The data frame observations contains those texts which are present in the ext file.
here is the text file - https://drive.google.com/open?id=1GApPKvA82tx4CDtlOTqe99zKXS3AHiuD
here is the link; Data = https://drive.google.com/open?id=1HJbWTUMfiBV54EEtgSXTcsQLzQT1rFgz
I am using the following code -
import nltk
from nltk.tokenize import word_tokenize
file = open("D://Users/Shivam/Desktop/rahulB/fliter.txt")
result = file.read()
words = word_tokenize(result)
I loaded the text files and converted them into words/tokens.
Its is my dataframe.
text
0 What Fresh Hell Is This? January 31, 2018 ...A...
1 What Fresh Hell Is This? February 27, 2018 My ...
2 What Fresh Hell Is This? March 31, 2018 Trump ...
3 What Fresh Hell Is This? April 29, 2018 Michel...
4 Join Email List Contribute Join AMERICAblog Ac...
If you see this, these texts are present in the all rows such as "What Fresh Hell Is This?" or "Join Email List Contribute Join AMERICAblog Ac, "Sign in Daily Roundup MS Legislature Elected O" etc.
I used this for loop
for word in words:
df['text'].replace(word, ' ')
my error.
error Traceback (most recent call last)
<ipython-input-168-6e0b8109b76a> in <module>()
----> 1 df['text'] = df['text'].str.replace("|".join(words), " ")
D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in replace(self, pat, repl, n, case, flags)
1577 def replace(self, pat, repl, n=-1, case=None, flags=0):
1578 result = str_replace(self._data, pat, repl, n=n, case=case,
-> 1579 flags=flags)
1580 return self._wrap_result(result)
1581
D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in str_replace(arr, pat, repl, n, case, flags)
422 if use_re:
423 n = n if n >= 0 else 0
--> 424 regex = re.compile(pat, flags=flags)
425 f = lambda x: regex.sub(repl=repl, string=x, count=n)
426 else:
D:\Users\Shivam\Anaconda2\lib\re.pyc in compile(pattern, flags)
192 def compile(pattern, flags=0):
193 "Compile a regular expression pattern, returning a pattern object."
--> 194 return _compile(pattern, flags)
195
196 def purge():
D:\Users\Shivam\Anaconda2\lib\re.pyc in _compile(*key)
249 p = sre_compile.compile(pattern, flags)
250 except error, v:
--> 251 raise error, v # invalid expression
252 if not bypass_cache:
253 if len(_cache) >= _MAXCACHE:
error: nothing to repeat
You can use str.replace
Ex:
df['text'] = df['text'].str.replace("|".join(words), " ")
You can modify your code in this way:
for word in words:
df['text'] = df['text'].str.replace(word, ' ')
You may use
df['text'] = df['text'].str.replace(r"\s*(?<!\w)(?:{})(?!\w)".format("|".join([re.escape(x) for x in words])), " ")
The r"(?<!\w)(?:{})(?!\w)".format("|".join([re.escape(x) for x in words])) line will perform these steps:
re.escape(x) for x in words] - will escape all special chars in the words to be used with regex safely
"|".join([...) - will create alternations that will be matched by regex engine
r"\s*(?<!\w)(?:{})(?!\w)".format(....) - will create a regex like \s*(?<!\w)(?:word1|word2|wordn)(?!\w) that will match words as whole words from the list (\s* will also remove 0+ whitespaces before the words).

Unable to capture certain phone numbers with different pattern

What should be the appropriate regular expression to capture all the phone numbers listed below? I tried with one and it partially does the work. However, I would like to get them all. Thanks for any suggestion or help.
Here are the numbers along with my script I tried with:
import re
content='''
415-555-1234
650-555-2345
(416)555-3456
202 555 4567
4035555678
1 416 555 9292
+1 416 555 9292
'''
for phone in re.findall(r'\+?1?\s?\(?\d*\)?[\s-]\d*[\s-]\d*',content):
print(phone)
The result I'm getting is:
415
-555-1234
650-555-2345
555-3456
202
555 4567
4035555678
1 416 555
9292
+1 416 555 9292
I suggest to make some parts of the regex obligatory (like the digit patterns, by replacing * with +) or it might match meaningless parts of texts. Also, note that \s matches any whitespace, while you most probably want to match strings on the same lines.
You might try
\+?1? ?(?:\(?\d+\)?)?(?:[ -]?\d+){1,2}
See the regex demo
Details
\+? - an optional plus
1? - an optional 1
? - and optional space
(?:\(?\d+\)?)? - an optional sequence of a (, then 1+ digits and then an optional )
(?:[ -]?\d+){1,2} - 1 or 2 occurrences of:
[ -]? - an optional space or -
\d+ - 1+ digits
I thinks this regx will work in your case
import re
content = '''
415-555-1234
650-555-2345
(416)555-3456
202 555 4567
4035555678
1 416 555 9292
+1 416 555 9292
'''
for phone in re.findall(r'(([+]?\d\s\d?)?\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})', content):
print phone[0]

How can I match the whole regex not the subexpression

Say, I have the following regex to search a series of room number:
import re
re.findall(r'\b(\d)\d\1\b','101 102 103 201 202 203')
I want to search for the room number whose first and last digit are the same (101 and 202). The above code gives
['1','2']
which corresponding to the subexpression (\d). But how can it return the whole room number like 101 and 202?
import re
print [i for i,j in re.findall(r'\b((\d)\d\2)\b','101 102 103 201 202 203')]
or
print [i[0] for i in re.findall(r'\b((\d)\d\2)\b','101 102 103 201 202 203')]
You can use list comprehension here.You need only room numbers so include only i.basically re.findall return all groups in a regex.So you need 2 groups.The first is will have room numbers and second will be used for matching.So we can extract just the first out of the tuple of 2.

Split String with Python Regexp

If I have a string like:
"|CLL23|STR. CALIFORNIA|CLL12|AV. PHILADELFIA 438|CLL10|AV. 234 DEPTO 34|"
I need to separate the string form next:
CLL23|STR.CALIFORNIA
CLL12|AV. TEXAS 345
CLL10|AV. 234 DEPTO 24
Try the following form:
r=re.compile('(?<=[|])([\w]+)')
v_sal=r.findall(v_campo)
print v_sal
Result:
['CLL23', 'CLL12', 'CLL10']
That way you could get the rest of the string in Python?
Let's define your string:
>>> s = "|CLL23|STR. CALIFORNIA|CLL12|AV. PHILADELFIA 438|CLL10|AV. 234 DEPTO 34|"
Now, let's print the formatted form:
>>> print('\n'.join('CLL' + word.rstrip('|') for word in s.split('|CLL') if word))
CLL23|STR. CALIFORNIA
CLL12|AV. PHILADELFIA 438
CLL10|AV. 234 DEPTO 34
The above divides on |CLL. This seems to work for your sample input.
Another simple solution would be to split() the string at every '|' and then print them in chunks:
s="|CLL23|STR. CALIFORNIA|CLL12|AV. PHILADELFIA 438|CLL10|AV. 234 DEPTO 34|"
s1=filter(None, s.split('|')) #split string and filter empty strings
for x,y in zip(s1[0::2], s1[1::2]):
print x + '|' + y
Output:
>>>
CLL23|STR. CALIFORNIA
CLL12|AV. PHILADELFIA 438
CLL10|AV. 234 DEPTO 34

Categories