Extracting strings with labels in python using regex - python

I want to extract strings with labels from text data in python.I have written following code written , but this replaces the actual data with the string , i want to extract that
import re
def replace_entities(example):
# dd mm yyyy
example = re.sub("(\d{1,31}(:? |\-|\/)\d{1,12}(:? |\-|\/)\d{4})", "DATESTR", example) # dd/mm/yyyy
example = re.sub("(\d{4}(:? |\-|\/)\d{1,31}(:? |\-|\/)\d{1,12})", "DATESTR", example) # yyyy/dd/mm
# email id
example = re.sub("[\w\.-]+#[\w\.-]+", "EMAILIDSTR", example)
# URL
example = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', "URLSTR",
example)
example = re.sub('www.(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', "URLSTR", example)
# TIME
example = re.sub("\d{2}:\d{2} (:?AM|PM|am|pm)", "TIMESTR", example)
example = re.sub("\d{2}:\d{2}:\d{3} (:?AM|PM|am|pm)", "TIMESTR", example)
# MONEY
example = re.sub(r'\£ \d+', "MONEYSTR", example, 0)
example = re.sub(r'\£\d+', "MONEYSTR", example, 0)
example = re.sub(r'\d+(:?\£|pound|pounds|euros|euro)', "MONEYSTR", example, 0)
example = re.sub(r'\d+ (:?\£|pound|pounds|euros|euro)', "MONEYSTR", example, 0)
example = re.sub(r'\d.\d+(:?\£|pound|pounds|euros|euro)', "MONEYSTR", example, 0)
example = re.sub(r'\d.\d+ (:?\£|pound|pounds|euros|euro)', "MONEYSTR", example, 0)
example = re.sub(r'\\xc2\\xa\d+', "MONEYSTR", example, 0)
example = re.sub(r'\\xc2\\xa\d+.\d+', "MONEYSTR", example, 0)
# Split alpha numeric and sp. symbol
example = " ".join(re.findall(r"[^,.:;\(\)\/\\_]+|[,.:;\(\)\/\\_]", example))
example = " ".join(re.findall(r"[^\d_]+|\d+", example))
example = re.sub('(?!^)([A-Z][a-z]+)', r' \1', example)
# NUMBERS
example = re.sub(r'\d+', 'NUMSTR', example)
return example
I have following text as input :
My name is ali, Date is 21/08/2018 Total amount is euros 10 . Account number is 123456
Expected_output is:
> 21/08/2018: DATESTR
euros 10 : MONEYSTR
123456 : NUMSTR
How can i get above output
Any ideas?

You may fix it by adding .*? before and .* after the pattern you have and replace with r'\1 : DATESTR'
res = re.sub(r'.*?(\d{1,31}(?::? |[-/])\d{1,12}(?::? |[-/])\d{4}).*', r'\1 : DATESTR', s)
See the regex demo. With .*? you match any 0+ chars other than line break chars, as few as possible, and with .* you match any 0+ chars other than line break chars, as many as possible, and that way you remove what you do not need by just matching and you keep what you capture.
You may also use your regex to extract the date and then append : DATESTR to it:
import re
rx = r"\d{1,31}(?::? |[-/])\d{1,12}(?::? |[-/])\d{4}"
s = "My name is ALi Date is 09/03/2018"
m = re.search(rx, s)
if m:
print("{} : DATESTR".format(m.group())) # => 09/03/2018 : DATESTR
See the Python demo.

You can give a try to datefinder
Here I have tried to get your example done with it:
>>> import datefinder
>>> str = 'My name is ALi Date is 09/03/2018'
>>> matches = datefinder.find_dates(str)
>>> for i in matches:
... print(i.strftime("%m/%d/%Y") + ':DATESTR')
...
09/03/2018:DATESTR
I guess this will help you. It can get any date string out of your string.

from your example you want to do 2 things:
Find a date-like string
Add another string at the end of your match
The solution I propose here might not be the best but it does the thing. I propose you get the match that your regex can find, and then use that match to format whatever you want to print.
import re
string1 = "My name is ALi Date is 09/03/2018"
string2 = "DATESTR"
m = re.search("(\d{1,31}(:? |\-|\/)\d{1,12}(:? |\-|\/)\d{4})", string1 ) # match the date : dd/mm/yyyy
print( m.group(0) + ' : ' + string2 )
The output is:
>>> 09/03/2018 : DATESTR
There might be some other functions that fit your needs in the documentation. That's what I just used.
https://docs.python.org/3/library/re.html

Related

Split a string into Name and Time

I want to split a string as it contains a combined string of name and time.
I want to split as shown in example below:
Complete string
cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478
Desired output
cDOT_storage01_esx_infra02 07-19-2021
Efforts performed, not giving desired output
j['name'].split("-")[0], j['name'].split("-")[1][0:10]
Use rsplit. The only two _ you care about are the last two, so you can limit the number of splits rsplit will attempt using _ as the delimiter.
>>> "cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478".rsplit("_", 2)
['cDOT_storage01_esx_infra02', '07-19-2021', '04.45.00.0478']
You can index the resulting list as necessary to get your final result.
If all the strings follow the same pattern (separated by an underscore(_)), you can try this.
(Untested)
string = "cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478"
splitted = list(map(str, string.split('_')))
# splitted[-1] will be "04.45.00.0478"
# splitted[-2] will be "07-19-2021"
# Rest of the list will contain the front part
other = splitted.pop()
date = splitted.pop()
name = '_'.join(splitted)
print(name, date)
You use regex for searching and printing.
import re
txt = "cDOT_storage01_esx_infra02_07-19-2021_04.45.00.0478"
# searching the date in the string
x = re.search("\d{2}-\d{2}-\d{4}", txt)
if x:
print("Matched")
a = re.split("[0-9]{2}-[0-9]{2}-[0-9]{4}", txt)
y = re.compile("\d{2}-\d{2}-\d{4}")
print(a[0][:-1] , " ", y.findall(txt)[0])
else:
print("No match")
Output:
Matched
cDOT_storage01_esx_infra02 07-19-2021

Replace string which has dynamic character in python

Trying to replace the string with regular expression and could not success.
The strings are "LIVE_CUS2_PHLR182" ,"LIVE_CUS2ee_PHLR182" and "PHLR182 - testing recovery".Here I need to get PHLR182 as an output with all the string but where second string has "ee" which is not constant. It can be string or number with 2 character.Below is the code what I have tried.
For first and last string I just simply used replace function like below.
s = "LIVE_CUS2_PHLR182"
s.replace("LIVE_CUS2_", ""), s.replace(" - testing recovery","")
>>> PHLR182
But for second I tried like below.
1. s= "LIVE_CUS2ee_PHLR182"
s.replace(r'LIVE_CUS2(\w+)*_','')
2. batRegex = re.compile(r'LIVE_CUS2(\w+)*_PHLR182')
mo2 = batRegex.search('LIVE_CUS2dd_PHLR182')
mo2.group()
3. re.sub(r'LIVE_CUS2(?is)/s+_PHLR182', '', r)
In all case I could not get "PHLR182" as an output. Please help me.
I think this is what you need:
import re
texts = """LIVE_CUS2_PHLR182
LIVE_CUS2ee_PHLR182
PHLR182 - testing recovery""".split('\n')
pat = re.compile(r'(LIVE_CUS2\w{,2}_| - testing recovery)')
# 1st alt pattern | 2nd alt pattern
# Look for 'LIV_CUS2_' with up to two alphanumeric characters after 2
# ... or Look for ' - testing recovery'
results = [pat.sub('', text) for text in texts]
# replace the matched pattern with empty string
print(f'Original: {texts}')
print(f'Results: {results}')
Result:
Original: ['LIVE_CUS2_PHLR182', 'LIVE_CUS2ee_PHLR182', 'PHLR182 - testing recovery']
Results: ['PHLR182', 'PHLR182', 'PHLR182']
Python Demo: https://repl.it/repls/ViolentThirdAutomaticvectorization
Regex Demo: https://regex101.com/r/JiEVqn/2

How to start at a specific letter and end when it hits a digit?

I have some sample strings:
s = 'neg(able-23, never-21) s2-1/3'
i = 'amod(Market-8, magical-5) s1'
I've got the problem where I can figure out if the string has 's1' or 's3' using:
word = re.search(r's\d$', s)
But if I want to know if the contains 's2-1/3' in it, it won't work.
Is there a regex expression that can be used so that it works for both cases of 's#' and 's#+?
Thanks!
You can allow the characters "-" and "/" to be captured as well, in addition to just digits. It's hard to tell the exact pattern you're going for here, but something like this would capture "s2-1/3" from your example:
import re
s = "neg(able-23, never-21) s2-1/3"
word = re.search(r"s\d[-/\d]*$", s)
I'm guessing that maybe you would want to extract that with some expression, such as:
(s\d+)-?(.*)$
Demo 1
or:
(s\d+)-?([0-9]+)?\/?([0-9]+)?$
Demo 2
Test
import re
expression = r"(s\d+)-?(.*)$"
string = """
neg(able-23, never-21) s211-12/31
neg(able-23, never-21) s2-1/3
amod(Market-8, magical-5) s1
"""
print(re.findall(expression, string, re.M))
Output
[('s211', '12/31'), ('s2', '1/3'), ('s1', '')]

Regex match everything between special tag

I have the following string that I need to parse and get the values of anything inside the defined \$ tags
for example, the string
The following math equation: \$f(x) = x^2\$ is the same as \$g(x) = x^(4/2) \$
I want to parse whatever is in between the \$ tags, so that the result will contain both equations
'f(x) = x^2'
'g(x) = x^(4/2) '
I tried something like re.compile(r'\\\$(.)*\\$') but it didnt work.
You almost got it, just missing a backslash and a question mark (so it stops as soon as it finds the second \$ and doesn't match the longest string possible): r'\\\$(.*?)\\\$'
>>> pattern = r'\\\$(.*?)\\\$'
>>> data = "The following math equation: \$f(x) = x^2\$ is the same as \$g(x) = x^(4/2) \$"
>>> re.findall(pattern, data)
['f(x) = x^2', 'g(x) = x^(4/2) ']
That regex can fit:
/\\\$.{0,}\\\$/g
/ - begin
\\\$ - escaped: \$
. - any character between
{0,} - at least 0 chars (any number of chars, actually)
\\\$ - escaped: \$
/ - end
g - global search
This works:
import re
regex = r'\\\$(.*)\\\$'
r = re.compile(regex)
print r.match("\$f(x) = x^2\$").group(1)
print r.match("\$g(x) = x^(4/2) \$").group(1)

How to substitute character in string using regex

I want to change the following string
^mylog\.20151204\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
to this:
^mylog\.2015-12-04\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
(20151204 changed to 2015-12-04 only)
I can accomplish it by:
re.sub("20151204", "2015-12-04", string)
where
string= ^mylog\.20151204\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
But the value 20151204 is a date and will change and I can't have it hardcoded.
I tried:
re.sub("2015\\d{2}\\d{2}", "2015\-\\d{2}\-\\d{2}", string)
However this did not work.
You need to use capture groups in the pattern and backreferences in the replacement:
result = re.sub("2015(\\d{2})(\\d{2})", "2015-\\1-\\2", string)
^ ^^ ^ ^^^ ^^^
// => ^mylog\.2015-12-04\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
See IDEONE demo
If you need to match any year after ^mylog\., you can use
result = re.sub(r"^\^mylog\\\.(\d{4})(\d{2})(\d{2})", r"^mylog\.\1-\2-\3", string)
See another demo
You first need to find the date and then convert it into the required format and then replace the new string in your old text.
See the code below:
text = "^mylog\.20151204\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$"
search = re.search(r'\d{4}\d{2}\d{2}',text)
search = search.group()
you get search as:
20151204
Now create the date as you want:
new_text = search[0:4] + "-" + search[4:6] + "-" + search[6:8]
So new_text will be:
2015-12-04
Now substitute this new_text in place of the earlier string using `re.sub()
text = re.sub(search,new_text,text)
So now text will be:
^mylog\.2015-12-04\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$

Categories