Python, I have a string like this, Input:
IBNR 13,123 1,234 ( 556 ) ( 2,355 ) 934
Required output- :
Either remove the space b/w the bracket and number
IBNR 13,123 1,234 (556) (2,355) 934
OR Remove the brackets:
IBNR 13,123 1,234 556 2,355 934
I have tried this:
re.sub('(?<=\d)+ (?=\\))','',text1)
This solves for right hand side, need help with left side.
You could use
import re
data = """IBNR 13,123 1,234 ( 556 ) ( 2,355 ) 934 """
def replacer(m):
return f"({m.group(1).strip()})"
data = re.sub(r'\(([^()]+)\)', replacer, data)
print(data)
# IBNR 13,123 1,234 (556) (2,355) 934
Or remove the parentheses altogether:
data = re.sub(r'[()]+', '', data)
# IBNR 13,123 1,234 556 2,355 934
As #JvdV points out, you might better use
re.sub(r'\(\s*(\S+)\s*\)', r'\1', data)
Escape the brackets with this pattern:
(\w+\s+\d+,\d+\s+\d+,\d+\s+)\((\s+\d+\s+)\)(\s+)\((\s+\d+,\d+\s)\)(\s+\d+)
See the results, including substitutions:
https://regex101.com/r/ch6Jge/1
I rarely use the lookahead at all, but I think it does what you want.
re.sub(r'\(\s(\d+(?:\,\d+)*)\s\)', r'\1', text1)
Related
I'm trying a regex to match a phone like +34(prefix), single space, followed by 9 digits that may or may not be separated by spaces.
+34 886 24 68 98
+34 980 202 157
I would need a regex to work with these two example cases.
I tried this ^(\+34)\s([ *]|[0-9]{9}) but is not it.
Ultimately I'll like to match a phone like +34 "prefix", single space, followed by 9 digits, no matter what of this cases given. For that I'm using re.sub() function but I'm not sure how.
+34 886 24 68 98 -> ?
+34 980 202 157 -> ?
+34 846082423 -> `^(\+34)\s(\d{9})$`
+34920459596 -> `^(\+34)(\d{9})$`
import re
from faker import Faker
from faker.providers import BaseProvider
#fake = Faker("es_ES")
class CustomProvider(BaseProvider):
def phone(self):
#phone = fake.phone_number()
phone = "+34812345678"
return re.sub(r'^(\+34)(\d{9})$', r'\1 \2', phone)
You can try:
^\+34\s*(?:\d\s*){9}$
^ - beginning of the string
\+34\s* - match +34 followed by any number of spaces
(?:\d\s*){9} - match number followed by any number of spaces 9 times
$ - end of string
Regex demo.
Here's a simple approach: use regex to get the plus sign and all the numbers into an array (one char per element), then use other list and string manipulation operations to format it the way you like.
import re
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
pattern = r'[+\d]'
m1 = re.findall(pattern, p1)
m2 = re.findall(pattern, p2)
m1_str = f"{''.join(m1[:3])} {''.join(m1[3:])}"
m2_str = f"{''.join(m2[:3])} {''.join(m2[3:])}"
print(m1_str) # +34 886246898
print(m2_str) # +34 980202157
Or removing spaces using string replacement instead of regex:
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
p1_compact = p1.replace(' ', '')
p2_compact = p2.replace(' ', '')
p1_str = f"{p1_compact[:3]} {p1_compact[3:]}"
p2_str = f"{p2_compact[:3]} {p2_compact[3:]}"
print(p1_str) # +34 886246898
print(p2_str) # +34 980202157
I would capture the numbers like this: r"(\+34(?:\s?\d){9})".
That will allows you to search for numbers allowing whitespace to optionally be placed before any of them. Using a non-capturing group ?: to allow repeating \s?\d without having each number listed as a group on its own.
import re
nums = """
Number 1: +34 886 24 68 98
Number 2: +34 980 202 157
Number 3: +34812345678
"""
number_re = re.compile(r"(\+34(?:\s?\d){9})")
for match in number_re.findall(nums):
print(match)
+34 886 24 68 98
+34 980 202 157
+34812345678
I am trying to write a regular expression which returns a part of substring which is after a string. For example: I want to get part of substring along with spaces which resides after "15/08/2017".
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
Is there a way to get 'AFFIDAVIT OF' and 'CASH & MTGE' as separate strings?
Here is the expression I have pieced together so far:
doc = (a.split('15/08/2017', 1)[1]).strip()
'AFFIDAVIT OF CASH & MTGE'
Not a regex based solution. But does the trick.
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
doc = (a.split('15/08/2017', 1)[1]).strip()
# used split with two white spaces instead of one to get the desired result
print(doc.split(" ")[0].strip()) # outputs AFFIDAVIT OF
print(doc.split(" ")[-1].strip()) # outputs CASH & MTGE
Hope it helps.
re based code snippet
import re
foo = '''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
pattern = '.*\d{2}/\d{2}/\d{4}\s+(\w+\s+\w+)\s+(\w+\s+.*\s+\w+)'
result = re.findall(pattern, foo, re.MULTILINE)
print "1st match: ", result[0][0]
print "2nd match: ", result[0][1]
Output
1st match: AFFIDAVIT OF
2nd match: CASH & MTGE
We can try using re.findall with the following pattern:
PHASED OF ((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)
Searching in multiline and DOTALL mode, the above pattern will match everything occurring between PHASED OF until, but not including, CONDOMINIUM PLAN.
input = "182 246 612 01/10/2018 PHASED OF CASH & MTGE\n CONDOMINIUM PLAN"
result = re.findall(r'PHASED OF (((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)', input, re.DOTALL|re.MULTILINE)
output = result[0][0].strip()
print(output)
CASH & MTGE
Note that I also strip off whitespace from the match. We might be able to modify the regex pattern to do this, but in a general solution, maybe you want to keep some of the whitespace, in certain cases.
Why regular expressions?
It looks like you know the exact delimiting string, just str.split() by it and get the first part:
In [1]: a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
In [2]: a.split("15/08/2017", 1)[0]
Out[2]: '172 211 342 '
I would avoid using regex here, because the only meaningful separation between the logical terms appears to be 2 or more spaces. Individual terms, including the one you want to match, may also have spaces. So, I recommend doing a regex split on the input using \s{2,} as the pattern. These will yield a list containing all the terms. Then, we can just walk down the list once, and when we find the forward looking term, we can return the previous term in the list.
import re
a = "172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE"
parts = re.compile("\s{2,}").split(a)
print(parts)
for i in range(1, len(parts)):
if (parts[i] == "15/08/2017"):
print(parts[i-1])
['172 211 342', '15/08/2017', 'TRANSFER OF LAND', '$610,000', 'CASH & MTGE']
172 211 342
positive lookbehind assertion**
m=re.search('(?<=15/08/2017).*', a)
m.group(0)
You have to return the right group:
re.match("(.*?)15/08/2017",a).group(1)
You nede to use group(1)
import re
re.match("(.*?)15/08/2017",a).group(1)
Output
'172 211 342 '
Building on your expression, this is what I believe you need:
import re
a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
re.match("(.*?)(\w+/)",a).group(1)
Output:
'172 211 342 '
You can do this by using group(1)
re.match("(.*?)15/08/2017",a).group(1)
UPDATE
For updated string you can use .search instead of .match
re.search("(.*?)15\/08\/2017",a).group(1)
Your problem is that your string is formatted the way it is.
The line you are looking for is
182 246 612 01/10/2018 PHASED OF CASH & MTGE
And then you are looking for what ever comes after 'PHASED OF' and some spaces.
You want to search for
(?<=PHASED OF)\s*(?P.*?)\n
in your string. This will return a match object containing the value you are looking for in the group value.
m = re.search(r'(?<=PHASED OF)\s*(?P<your_text>.*?)\n', a)
your_desired_text = m.group('your_text')
Also: There are many good online regex testers to fiddle around with your regexes.
And only after finishing up the regex just copy and paste it into python.
I use this one: https://regex101.com/
I am trying to write a regular expression which returns a part of substring which is after a string. For example: I want to get part of substring along with spaces which resides after "15/08/2017".
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
Is there a way to get 'AFFIDAVIT OF' and 'CASH & MTGE' as separate strings?
Here is the expression I have pieced together so far:
doc = (a.split('15/08/2017', 1)[1]).strip()
'AFFIDAVIT OF CASH & MTGE'
Not a regex based solution. But does the trick.
a='''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
doc = (a.split('15/08/2017', 1)[1]).strip()
# used split with two white spaces instead of one to get the desired result
print(doc.split(" ")[0].strip()) # outputs AFFIDAVIT OF
print(doc.split(" ")[-1].strip()) # outputs CASH & MTGE
Hope it helps.
re based code snippet
import re
foo = '''S
LINC SHORT LEGAL TITLE NUMBER
0037 471 661 1720278;16;21 172 211 342
LEGAL DESCRIPTION
PLAN 1720278
BLOCK 16
LOT 21
EXCEPTING THEREOUT ALL MINES AND MINERALS
ESTATE: FEE SIMPLE
ATS REFERENCE: 4;24;54;2;SW
MUNICIPALITY: CITY OF EDMONTON
REFERENCE NUMBER: 172 023 641 +71
----------------------------------------------------------------------------
----
REGISTERED OWNER(S)
REGISTRATION DATE(DMY) DOCUMENT TYPE VALUE CONSIDERATION
---------------------------------------------------------------------------
--
---
172 211 342 15/08/2017 AFFIDAVIT OF CASH & MTGE'''
pattern = '.*\d{2}/\d{2}/\d{4}\s+(\w+\s+\w+)\s+(\w+\s+.*\s+\w+)'
result = re.findall(pattern, foo, re.MULTILINE)
print "1st match: ", result[0][0]
print "2nd match: ", result[0][1]
Output
1st match: AFFIDAVIT OF
2nd match: CASH & MTGE
We can try using re.findall with the following pattern:
PHASED OF ((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)
Searching in multiline and DOTALL mode, the above pattern will match everything occurring between PHASED OF until, but not including, CONDOMINIUM PLAN.
input = "182 246 612 01/10/2018 PHASED OF CASH & MTGE\n CONDOMINIUM PLAN"
result = re.findall(r'PHASED OF (((?!\bCONDOMINIUM PLAN).)*)(?=CONDOMINIUM PLAN)', input, re.DOTALL|re.MULTILINE)
output = result[0][0].strip()
print(output)
CASH & MTGE
Note that I also strip off whitespace from the match. We might be able to modify the regex pattern to do this, but in a general solution, maybe you want to keep some of the whitespace, in certain cases.
Why regular expressions?
It looks like you know the exact delimiting string, just str.split() by it and get the first part:
In [1]: a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
In [2]: a.split("15/08/2017", 1)[0]
Out[2]: '172 211 342 '
I would avoid using regex here, because the only meaningful separation between the logical terms appears to be 2 or more spaces. Individual terms, including the one you want to match, may also have spaces. So, I recommend doing a regex split on the input using \s{2,} as the pattern. These will yield a list containing all the terms. Then, we can just walk down the list once, and when we find the forward looking term, we can return the previous term in the list.
import re
a = "172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE"
parts = re.compile("\s{2,}").split(a)
print(parts)
for i in range(1, len(parts)):
if (parts[i] == "15/08/2017"):
print(parts[i-1])
['172 211 342', '15/08/2017', 'TRANSFER OF LAND', '$610,000', 'CASH & MTGE']
172 211 342
positive lookbehind assertion**
m=re.search('(?<=15/08/2017).*', a)
m.group(0)
You have to return the right group:
re.match("(.*?)15/08/2017",a).group(1)
You nede to use group(1)
import re
re.match("(.*?)15/08/2017",a).group(1)
Output
'172 211 342 '
Building on your expression, this is what I believe you need:
import re
a='172 211 342 15/08/2017 TRANSFER OF LAND $610,000 CASH & MTGE'
re.match("(.*?)(\w+/)",a).group(1)
Output:
'172 211 342 '
You can do this by using group(1)
re.match("(.*?)15/08/2017",a).group(1)
UPDATE
For updated string you can use .search instead of .match
re.search("(.*?)15\/08\/2017",a).group(1)
Your problem is that your string is formatted the way it is.
The line you are looking for is
182 246 612 01/10/2018 PHASED OF CASH & MTGE
And then you are looking for what ever comes after 'PHASED OF' and some spaces.
You want to search for
(?<=PHASED OF)\s*(?P.*?)\n
in your string. This will return a match object containing the value you are looking for in the group value.
m = re.search(r'(?<=PHASED OF)\s*(?P<your_text>.*?)\n', a)
your_desired_text = m.group('your_text')
Also: There are many good online regex testers to fiddle around with your regexes.
And only after finishing up the regex just copy and paste it into python.
I use this one: https://regex101.com/
I have a text file which contains some strings that I want to remove from my data frame. The data frame observations contains those texts which are present in the ext file.
here is the text file - https://drive.google.com/open?id=1GApPKvA82tx4CDtlOTqe99zKXS3AHiuD
here is the link; Data = https://drive.google.com/open?id=1HJbWTUMfiBV54EEtgSXTcsQLzQT1rFgz
I am using the following code -
import nltk
from nltk.tokenize import word_tokenize
file = open("D://Users/Shivam/Desktop/rahulB/fliter.txt")
result = file.read()
words = word_tokenize(result)
I loaded the text files and converted them into words/tokens.
Its is my dataframe.
text
0 What Fresh Hell Is This? January 31, 2018 ...A...
1 What Fresh Hell Is This? February 27, 2018 My ...
2 What Fresh Hell Is This? March 31, 2018 Trump ...
3 What Fresh Hell Is This? April 29, 2018 Michel...
4 Join Email List Contribute Join AMERICAblog Ac...
If you see this, these texts are present in the all rows such as "What Fresh Hell Is This?" or "Join Email List Contribute Join AMERICAblog Ac, "Sign in Daily Roundup MS Legislature Elected O" etc.
I used this for loop
for word in words:
df['text'].replace(word, ' ')
my error.
error Traceback (most recent call last)
<ipython-input-168-6e0b8109b76a> in <module>()
----> 1 df['text'] = df['text'].str.replace("|".join(words), " ")
D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in replace(self, pat, repl, n, case, flags)
1577 def replace(self, pat, repl, n=-1, case=None, flags=0):
1578 result = str_replace(self._data, pat, repl, n=n, case=case,
-> 1579 flags=flags)
1580 return self._wrap_result(result)
1581
D:\Users\Shivam\Anaconda2\lib\site-packages\pandas\core\strings.pyc in str_replace(arr, pat, repl, n, case, flags)
422 if use_re:
423 n = n if n >= 0 else 0
--> 424 regex = re.compile(pat, flags=flags)
425 f = lambda x: regex.sub(repl=repl, string=x, count=n)
426 else:
D:\Users\Shivam\Anaconda2\lib\re.pyc in compile(pattern, flags)
192 def compile(pattern, flags=0):
193 "Compile a regular expression pattern, returning a pattern object."
--> 194 return _compile(pattern, flags)
195
196 def purge():
D:\Users\Shivam\Anaconda2\lib\re.pyc in _compile(*key)
249 p = sre_compile.compile(pattern, flags)
250 except error, v:
--> 251 raise error, v # invalid expression
252 if not bypass_cache:
253 if len(_cache) >= _MAXCACHE:
error: nothing to repeat
You can use str.replace
Ex:
df['text'] = df['text'].str.replace("|".join(words), " ")
You can modify your code in this way:
for word in words:
df['text'] = df['text'].str.replace(word, ' ')
You may use
df['text'] = df['text'].str.replace(r"\s*(?<!\w)(?:{})(?!\w)".format("|".join([re.escape(x) for x in words])), " ")
The r"(?<!\w)(?:{})(?!\w)".format("|".join([re.escape(x) for x in words])) line will perform these steps:
re.escape(x) for x in words] - will escape all special chars in the words to be used with regex safely
"|".join([...) - will create alternations that will be matched by regex engine
r"\s*(?<!\w)(?:{})(?!\w)".format(....) - will create a regex like \s*(?<!\w)(?:word1|word2|wordn)(?!\w) that will match words as whole words from the list (\s* will also remove 0+ whitespaces before the words).
What should be the appropriate regular expression to capture all the phone numbers listed below? I tried with one and it partially does the work. However, I would like to get them all. Thanks for any suggestion or help.
Here are the numbers along with my script I tried with:
import re
content='''
415-555-1234
650-555-2345
(416)555-3456
202 555 4567
4035555678
1 416 555 9292
+1 416 555 9292
'''
for phone in re.findall(r'\+?1?\s?\(?\d*\)?[\s-]\d*[\s-]\d*',content):
print(phone)
The result I'm getting is:
415
-555-1234
650-555-2345
555-3456
202
555 4567
4035555678
1 416 555
9292
+1 416 555 9292
I suggest to make some parts of the regex obligatory (like the digit patterns, by replacing * with +) or it might match meaningless parts of texts. Also, note that \s matches any whitespace, while you most probably want to match strings on the same lines.
You might try
\+?1? ?(?:\(?\d+\)?)?(?:[ -]?\d+){1,2}
See the regex demo
Details
\+? - an optional plus
1? - an optional 1
? - and optional space
(?:\(?\d+\)?)? - an optional sequence of a (, then 1+ digits and then an optional )
(?:[ -]?\d+){1,2} - 1 or 2 occurrences of:
[ -]? - an optional space or -
\d+ - 1+ digits
I thinks this regx will work in your case
import re
content = '''
415-555-1234
650-555-2345
(416)555-3456
202 555 4567
4035555678
1 416 555 9292
+1 416 555 9292
'''
for phone in re.findall(r'(([+]?\d\s\d?)?\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})', content):
print phone[0]