Extract substrings separately from a string using python regex

Extract substrings separately from a string using python regex - python

I am trying to write a regular expression which returns a part of substring which is after a string. For example: I want to get part of substring along with spaces which resides after "15/08/2017".
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
Is there a way to get 'AFFIDAVIT OF' and 'CASH & MTGE' as separate strings?
Here is the expression I have pieced together so far:
doc = (a.split('15/08/2017', 1)[1]).strip()
'AFFIDAVIT OF CASH & MTGE'

Not a regex based solution. But does the trick.
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
doc = (a.split('15/08/2017', 1)[1]).strip()
# used split with two white spaces instead of one to get the desired result
print(doc.split(" ")[0].strip()) # outputs AFFIDAVIT OF
print(doc.split(" ")[-1].strip()) # outputs CASH & MTGE
Hope it helps.

re based code snippet
import re
foo = '''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
pattern = '.*\d{2}/\d{2}/\d{4}\s+(\w+\s+\w+)\s+(\w+\s+.*\s+\w+)'
result = re.findall(pattern, foo, re.MULTILINE)
print "1st match: ", result[0][0]
print "2nd match: ", result[0][1]
Output
1st match: AFFIDAVIT OF
2nd match: CASH & MTGE

We can try using re.findall with the following pattern:
PHASED OF ((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)
Searching in multiline and DOTALL mode, the above pattern will match everything occurring between PHASED OF until, but not including, CONDOMINIUM PLAN.
input = "182 246 612 01/10/2018 PHASED OF CASH & MTGE\n CONDOMINIUM PLAN"
result = re.findall(r'PHASED OF (((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)', input, re.DOTALL|re.MULTILINE)
output = result[0][0].strip()
print(output)
CASH & MTGE
Note that I also strip off whitespace from the match. We might be able to modify the regex pattern to do this, but in a general solution, maybe you want to keep some of the whitespace, in certain cases.

Why regular expressions?
It looks like you know the exact delimiting string, just str.split() by it and get the first part:
In [1]: a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
In [2]: a.split("15/08/2017", 1)[0]
Out[2]: '172 211 342 '

I would avoid using regex here, because the only meaningful separation between the logical terms appears to be 2 or more spaces. Individual terms, including the one you want to match, may also have spaces. So, I recommend doing a regex split on the input using \s{2,} as the pattern. These will yield a list containing all the terms. Then, we can just walk down the list once, and when we find the forward looking term, we can return the previous term in the list.
import re
a = "172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE"
parts = re.compile("\s{2,}").split(a)
print(parts)
for i in range(1, len(parts)):
if (parts[i] == "15/08/2017"):
print(parts[i-1])
['172 211 342', '15/08/2017', 'TRANSFER OF LAND', '$610,000', 'CASH & MTGE']
172 211 342

positive lookbehind assertion**
m=re.search('(?<=15/08/2017).*', a)
m.group(0)

You have to return the right group:
re.match("(.*?)15/08/2017",a).group(1)

You nede to use group(1)
import re
re.match("(.*?)15/08/2017",a).group(1)
Output
'172 211 342 '

Building on your expression, this is what I believe you need:
import re
a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
re.match("(.*?)(\w+/)",a).group(1)
Output:
'172 211 342 '

You can do this by using group(1)
re.match("(.*?)15/08/2017",a).group(1)
UPDATE
For updated string you can use .search instead of .match
re.search("(.*?)15\/08\/2017",a).group(1)

Your problem is that your string is formatted the way it is.
The line you are looking for is
182 246 612 01/10/2018 PHASED OF CASH & MTGE
And then you are looking for what ever comes after 'PHASED OF' and some spaces.
You want to search for
(?<=PHASED OF)\s*(?P.*?)\n
in your string. This will return a match object containing the value you are looking for in the group value.
m = re.search(r'(?<=PHASED OF)\s*(?P<your_text>.*?)\n', a)
your_desired_text = m.group('your_text')
Also: There are many good online regex testers to fiddle around with your regexes.
And only after finishing up the regex just copy and paste it into python.
I use this one: https://regex101.com/

Related

Regex to unify a format of phone numbers in Python

I'm trying a regex to match a phone like +34(prefix), single space, followed by 9 digits that may or may not be separated by spaces.
+34 886 24 68 98
+34 980 202 157
I would need a regex to work with these two example cases.
I tried this ^(\+34)\s([ *]|[0-9]{9}) but is not it.
Ultimately I'll like to match a phone like +34 "prefix", single space, followed by 9 digits, no matter what of this cases given. For that I'm using re.sub() function but I'm not sure how.
+34 886 24 68 98 -> ?
+34 980 202 157 -> ?
+34 846082423 -> `^(\+34)\s(\d{9})$`
+34920459596 -> `^(\+34)(\d{9})$`
import re
from faker import Faker
from faker.providers import BaseProvider
#fake = Faker("es_ES")
class CustomProvider(BaseProvider):
def phone(self):
#phone = fake.phone_number()
phone = "+34812345678"
return re.sub(r'^(\+34)(\d{9})$', r'\1 \2', phone)

You can try:
^\+34\s*(?:\d\s*){9}$
^ - beginning of the string
\+34\s* - match +34 followed by any number of spaces
(?:\d\s*){9} - match number followed by any number of spaces 9 times
$ - end of string
Regex demo.

Here's a simple approach: use regex to get the plus sign and all the numbers into an array (one char per element), then use other list and string manipulation operations to format it the way you like.
import re
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
pattern = r'[+\d]'
m1 = re.findall(pattern, p1)
m2 = re.findall(pattern, p2)
m1_str = f"{''.join(m1[:3])} {''.join(m1[3:])}"
m2_str = f"{''.join(m2[:3])} {''.join(m2[3:])}"
print(m1_str) # +34 886246898
print(m2_str) # +34 980202157
Or removing spaces using string replacement instead of regex:
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
p1_compact = p1.replace(' ', '')
p2_compact = p2.replace(' ', '')
p1_str = f"{p1_compact[:3]} {p1_compact[3:]}"
p2_str = f"{p2_compact[:3]} {p2_compact[3:]}"
print(p1_str) # +34 886246898
print(p2_str) # +34 980202157

I would capture the numbers like this: r"(\+34(?:\s?\d){9})".
That will allows you to search for numbers allowing whitespace to optionally be placed before any of them. Using a non-capturing group ?: to allow repeating \s?\d without having each number listed as a group on its own.
import re
nums = """
Number 1: +34 886 24 68 98
Number 2: +34 980 202 157
Number 3: +34812345678
"""
number_re = re.compile(r"(\+34(?:\s?\d){9})")
for match in number_re.findall(nums):
print(match)
+34 886 24 68 98
+34 980 202 157
+34812345678

Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)

Here is my sample data:
import pandas as pd
import re
cars = pd.DataFrame({'Engine Information': {0: 'Honda 2.4L 4 cylinder 190 hp 162 ft-lbs',
1: 'Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs',
2: 'Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs',
3: 'MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs',
4: 'Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV',
5: 'GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs'},
'HP': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None}})
Here is my desired output:
I have created a new column called 'HP' where I want to extract the horsepower figure from the original column ('Engine Information')
Here is the code I have tried to do this:
cars['HP'] = cars['Engine Information'].apply(lambda x: re.match(r'\\d+(?=\\shp|hp)', str(x)))
The idea is I want to regex match the pattern: 'a sequence of numbers that come before either 'hp' or ' hp'. This is because some of the cells have no 'space' in between the number and 'hp' as showed in my example.
I'm sure the regex is correct, because I have successfully done a similar process in R. However, I have tried functions such as str.extract, re.findall, re.search, re.match. Either returning errors or 'None' values (as shown in the sample). So here I am a bit lost.
Thanks!

You can use str.extract:
cars['HP'] = cars['Engine Information'].str.extract(r'(\d+)\s*hp\b', flags=re.I)
Details
(\d+)\s*hp\b - matches and captures into Group 1 one or more digits, then just matches 0 or more whitespaces (\s*) and hp (in a case insensitive way due to flags=re.I) as a whole word (since \b marks a word boundary)
str.extract only returns the captured value if there is a capturing group in the pattern, so the hp and whitespaces are not part of the result.
Python demo results:
>>> cars
Engine Information HP
0 Honda 2.4L 4 cylinder 190 hp 162 ft-lbs 190
1 Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs 420
2 Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs 390
3 MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs 118
4 Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV 360
5 GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs 352

There are several problems:
re.match just looks at the beginning of your string, use re.search if your pattern may appear anywhere
don't escape if you use a raw string, i.e. either'\\d hp' or r'\d hp' - raw strings help your exactly to avoid escaping
Return the matched group. You just search but do not yield the group found. re.search(rex, string) gives you a complex object (a match object) from this you can extract all groups, e.g. re.search(rex, string)[0]
you have to wrap the access in a separate function because you have to check if there was any match before accessing the group. If you don't do that, an exception may stop the apply process right in the middle
apply is slow; use pandas vectorized functions like extract: cars['Engine Information'].str.extract(r'(\d+) ?hp')
Your approach should work with this:
def match_horsepower(s):
m = re.search(r'(\d+) ?hp', s)
return int(m[1]) if m else None
cars['HP'] = cars['Engine Information'].apply(match_horsepower)

This is will get numeric value just before hp, without or with (single or multiple) spaces.
r'\d+(?=\s+hp|hp)'
You can verify Regex Here: https://regex101.com/r/pXySxm/1

How to extract first floating numbers appearing after a word?

I'm trying to build an application for text extraction use case but I was not able to extract exact price from it.
I have a text like this,
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
I want to extract the price that appearing after the word total using regex but I was only able to extract all floating numbers. Also do note some-times you may also see words such as sub total but I only need price that appears after the word total. Also sometimes after total there may occur other words as well. So Regex should match word total and extract the floating numbers that appears next to it.
Any help is appreciated.
This is what I've tried,
re.findall("\d+\.\d+", string1) # this returns all floating numbers.

You can try
(?<=\\nTotal)\:?\D+([\d\.]+)
Demo

You could try this, should work for the example and the other restrictions you mentioned
import re
result = re.search('Total\n\$(\d+.\d+)', string1)
result.group(1) # 191.44
result = re.search('Total\:\n.+\n(\d+.\d+)', string2)
result.group(1) # 54.50
EDIT: If you want only one expression for both, you could try
result = re.search('\nTotal\:?(\n\D+)*\n\$?(\d+.\d+)', string)
re.group(2)

You could use a positive lookbehind to prevent sub being before total, word boundaries to prevent the words being part of a larger word and a capturing group to capture the price.
(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))
In parts:
(?<!\bsub ) Negative lookbehind, assert what is on the left is not the word sub and a space
\btotal\b Match total between word boundaries to prevent it being part of a larger word
\D* Match 0+ times any char that is not a digit
( Capture group 1
\d+(?:\.\d+) Match 1+ digits with an optional decimal part
) Close group
Regex demo | Python demo
For example
import re
regex = r"(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))"
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
print(re.findall(regex, string1, re.IGNORECASE))
print(re.findall(regex, string2, re.IGNORECASE))
Output
['191.44']
['54.50']
If what precedes the price should be a dollar sign of the text CHF, you might use an alternation (?:\$|CHF)\s* matching of the values followed by matching 0+ whitespace chars:
(?<!\bsub )\btotal\b\D*(?:\$|CHF)\s*(\d+(?:\.\d+))
Regex demo

Something like this might do the trick:
(?<!sub )total.*?(\d+.\d+)
Make sure to ignore the case.

Extract a string before a string using regex [duplicate]

I am trying to write a regular expression which returns a part of substring which is after a string. For example: I want to get part of substring along with spaces which resides after "15/08/2017".
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
Is there a way to get 'AFFIDAVIT OF' and 'CASH & MTGE' as separate strings?
Here is the expression I have pieced together so far:
doc = (a.split('15/08/2017', 1)[1]).strip()
'AFFIDAVIT OF CASH & MTGE'

Not a regex based solution. But does the trick.
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
doc = (a.split('15/08/2017', 1)[1]).strip()
# used split with two white spaces instead of one to get the desired result
print(doc.split(" ")[0].strip()) # outputs AFFIDAVIT OF
print(doc.split(" ")[-1].strip()) # outputs CASH & MTGE
Hope it helps.

re based code snippet
import re
foo = '''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
pattern = '.*\d{2}/\d{2}/\d{4}\s+(\w+\s+\w+)\s+(\w+\s+.*\s+\w+)'
result = re.findall(pattern, foo, re.MULTILINE)
print "1st match: ", result[0][0]
print "2nd match: ", result[0][1]
Output
1st match: AFFIDAVIT OF
2nd match: CASH & MTGE

We can try using re.findall with the following pattern:
PHASED OF ((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)
Searching in multiline and DOTALL mode, the above pattern will match everything occurring between PHASED OF until, but not including, CONDOMINIUM PLAN.
input = "182 246 612 01/10/2018 PHASED OF CASH & MTGE\n CONDOMINIUM PLAN"
result = re.findall(r'PHASED OF (((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)', input, re.DOTALL|re.MULTILINE)
output = result[0][0].strip()
print(output)
CASH & MTGE
Note that I also strip off whitespace from the match. We might be able to modify the regex pattern to do this, but in a general solution, maybe you want to keep some of the whitespace, in certain cases.

Why regular expressions?
It looks like you know the exact delimiting string, just str.split() by it and get the first part:
In [1]: a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
In [2]: a.split("15/08/2017", 1)[0]
Out[2]: '172 211 342 '

I would avoid using regex here, because the only meaningful separation between the logical terms appears to be 2 or more spaces. Individual terms, including the one you want to match, may also have spaces. So, I recommend doing a regex split on the input using \s{2,} as the pattern. These will yield a list containing all the terms. Then, we can just walk down the list once, and when we find the forward looking term, we can return the previous term in the list.
import re
a = "172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE"
parts = re.compile("\s{2,}").split(a)
print(parts)
for i in range(1, len(parts)):
if (parts[i] == "15/08/2017"):
print(parts[i-1])
['172 211 342', '15/08/2017', 'TRANSFER OF LAND', '$610,000', 'CASH & MTGE']
172 211 342

positive lookbehind assertion**
m=re.search('(?<=15/08/2017).*', a)
m.group(0)

You have to return the right group:
re.match("(.*?)15/08/2017",a).group(1)

You nede to use group(1)
import re
re.match("(.*?)15/08/2017",a).group(1)
Output
'172 211 342 '

Building on your expression, this is what I believe you need:
import re
a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
re.match("(.*?)(\w+/)",a).group(1)
Output:
'172 211 342 '

You can do this by using group(1)
re.match("(.*?)15/08/2017",a).group(1)
UPDATE
For updated string you can use .search instead of .match
re.search("(.*?)15\/08\/2017",a).group(1)

Your problem is that your string is formatted the way it is.
The line you are looking for is
182 246 612 01/10/2018 PHASED OF CASH & MTGE
And then you are looking for what ever comes after 'PHASED OF' and some spaces.
You want to search for
(?<=PHASED OF)\s*(?P.*?)\n
in your string. This will return a match object containing the value you are looking for in the group value.
m = re.search(r'(?<=PHASED OF)\s*(?P<your_text>.*?)\n', a)
your_desired_text = m.group('your_text')
Also: There are many good online regex testers to fiddle around with your regexes.
And only after finishing up the regex just copy and paste it into python.
I use this one: https://regex101.com/

Python regex for UK number

Below given are the UK phone numbers need to fetch from text file:
07791523634
07910221698
But it only print 0779152363, 0791022169 skipping the 11th character.
Also it produce unnecessary values like ('')
Ex : '', '07800 854536'
Below is the regex I've used:
phnsrch = re.compile(r'\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{5}|\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|/^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$/|')
Need help to fetch the complete set of 11 numbers without any unnecessary symbols

Finally figured out the solution for matching the UK numbers below:
07540858798
0113 2644489
02074 735 217
07512 850433
01942 896007
01915222200
01582 492734
07548 021 475
020 8563 7296
07791523634
re.compile(r'\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|(?:\d{4}\)?[\s-]?\d{3}[\s-]?\d{4})')
Thanks to those who helped me with this issue.

I think your regex is too long and can be more easier, try this regex instead:
^(07\d{8,12}|447\d{7,11})$

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract substrings separately from a string using python regex - python

Why regular expressions? It looks like you know the exact delimiting string, just str.split() by it and get the first part: In [1]: a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE' In [2]: a.split("15/08/2017", 1)[0] Out[2]: '172 211 342 '

positive lookbehind assertion** m=re.search('(?<=15/08/2017).*', a) m.group(0)

You have to return the right group: re.match("(.*?)15/08/2017",a).group(1)

You nede to use group(1) import re re.match("(.*?)15/08/2017",a).group(1) Output '172 211 342 '

Building on your expression, this is what I believe you need: import re a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE' re.match("(.*?)(\w+/)",a).group(1) Output: '172 211 342 '

You can do this by using group(1) re.match("(.?)15/08/2017",a).group(1) UPDATE For updated string you can use .search instead of .match re.search("(.?)15\/08\/2017",a).group(1)

Related

Regex to unify a format of phone numbers in Python

Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)

How to extract first floating numbers appearing after a word?

Extract a string before a string using regex [duplicate]

Python regex for UK number

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract substrings separately from a string using python regex - python

Why regular expressions? It looks like you know the exact delimiting string, just str.split() by it and get the first part: In [1]: a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE' In [2]: a.split("15/08/2017", 1)[0] Out[2]: '172 211 342 '

positive lookbehind assertion** m=re.search('(?<=15/08/2017).*', a) m.group(0)

You have to return the right group: re.match("(.*?)15/08/2017",a).group(1)

You nede to use group(1) import re re.match("(.*?)15/08/2017",a).group(1) Output '172 211 342 '

Building on your expression, this is what I believe you need: import re a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE' re.match("(.*?)(\w+/)",a).group(1) Output: '172 211 342 '

You can do this by using group(1) re.match("(.*?)15/08/2017",a).group(1) UPDATE For updated string you can use .search instead of .match re.search("(.*?)15\/08\/2017",a).group(1)

Related

Regex to unify a format of phone numbers in Python

Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)

How to extract first floating numbers appearing after a word?

Extract a string before a string using regex [duplicate]

Python regex for UK number

Categories

Resources

You can do this by using group(1) re.match("(.?)15/08/2017",a).group(1) UPDATE For updated string you can use .search instead of .match re.search("(.?)15\/08\/2017",a).group(1)