How to substitute character in string using regex

How to substitute character in string using regex - python

I want to change the following string
^mylog\.20151204\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
to this:
^mylog\.2015-12-04\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
(20151204 changed to 2015-12-04 only)
I can accomplish it by:
re.sub("20151204", "2015-12-04", string)
where
string= ^mylog\.20151204\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
But the value 20151204 is a date and will change and I can't have it hardcoded.
I tried:
re.sub("2015\\d{2}\\d{2}", "2015\-\\d{2}\-\\d{2}", string)
However this did not work.

You need to use capture groups in the pattern and backreferences in the replacement:
result = re.sub("2015(\\d{2})(\\d{2})", "2015-\\1-\\2", string)
^ ^^ ^ ^^^ ^^^
// => ^mylog\.2015-12-04\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$
See IDEONE demo
If you need to match any year after ^mylog\., you can use
result = re.sub(r"^\^mylog\\\.(\d{4})(\d{2})(\d{2})", r"^mylog\.\1-\2-\3", string)
See another demo

You first need to find the date and then convert it into the required format and then replace the new string in your old text.
See the code below:
text = "^mylog\.20151204\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$"
search = re.search(r'\d{4}\d{2}\d{2}',text)
search = search.group()
you get search as:
20151204
Now create the date as you want:
new_text = search[0:4] + "-" + search[4:6] + "-" + search[6:8]
So new_text will be:
2015-12-04
Now substitute this new_text in place of the earlier string using `re.sub()
text = re.sub(search,new_text,text)
So now text will be:
^mylog\.2015-12-04\-\d{2}\:\d{2}\:\d{2}\.gc\.log\.gz$

Related

Regex to ignore Semicolon

I have one column in a dataframe with key value pairs I would like to extract.
'AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619'
I would like to parse key value pairs like so
('AF_ESP', '0.00546')
('AF_EXAC', '0.00165')
('AF_TGP', '0.00619')
Here is my regex.
([^=]+)=([^;]+)
This gets me most of way there:
('AF_ESP', '0.00546')
(';AF_EXAC', '0.00165')
(';AF_TGP', '0.00619')
How can I adjust it so semicolons are not captured in the result?

You can consume the semi-colon or start of string in front:
(?:;|^)([^=]+)=([^;]+)
See the regex demo. Details:
(?:;|^) - a non-capturing group matching ; or start of string
([^=]+) - Group 1: one or more chars other than =
= - a = char
([^;]+) - Group 2: one or more chars other than ;.
See the Python demo:
import re
text = "AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619"
print( re.findall(r'(?:;|^)([^=]+)=([^;]+)', text) )
# => [('AF_ESP', '0.00546'), ('AF_EXAC', '0.00165'), ('AF_TGP', '0.00619')]
A non-regex solution is also possible:
text = "AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619"
print( [x.split('=') for x in text.split(';')] )
# => [['AF_ESP', '0.00546'], ['AF_EXAC', '0.00165'], ['AF_TGP', '0.00619']]
See this Python demo.

This can be also solved with a split method:
text = "AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619"
print([tuple(i.split('=')) for i in text.split(';')])
output:
[('AF_ESP', '0.00546'), ('AF_EXAC', '0.00165'), ('AF_TGP', '0.00619')]

An alternate and somewhat simpler approach to #Wiktor's solution is, in steps:
Capture everything until the =.
Get the = but don't capture that.
Get everything after the = up until an optional ; if that exists.
This would translate to the following regex:
([^=]+)=([^;]+);?
And in python:
>>> re.findall(r'([^=]+)=([^;]+);?', "AF_ESP=0.00546;AF_EXAC=0.00165;AF_TGP=0.00619")
[('AF_ESP', '0.00546'), ('AF_EXAC', '0.00165'), ('AF_TGP', '0.00619')]

How to remove text before a particular character or string in multi-line text?

I want to remove all the text before and including */ in a string.
For example, consider:
string = ''' something
other things
etc. */ extra text.
'''
Here I want extra text. as the output.
I tried:
string = re.sub("^(.*)(?=*/)", "", string)
I also tried:
string = re.sub(re.compile(r"^.\*/", re.DOTALL), "", string)
But when I print string, it did not perform the operation I wanted and the whole string is printing.

I suppose you're fine without regular expressions:
string[string.index("*/ ")+3:]
And if you want to strip that newline:
string[string.index("*/ ")+3:].rstrip()

The problem with your first regex is that . does not match newlines as you noticed. With your second one, you were closer but forgot the * that time. This would work:
string = re.sub(re.compile(r"^.*\*/", re.DOTALL), "", string)
You can also just get the part of the string that comes after your "*/":
string = re.search(r"(\*/)(.*)", string, re.DOTALL).group(2)

Update: After doing some research, I found that the pattern (\n|.) to match everything including newlines is inefficient. I've updated the answer to use [\s\S] instead as shown on the answer I linked.
The problem is that . in python regex matches everything except newlines. For a regex solution, you can do the following:
import re
strng = ''' something
other things
etc. */ extra text.
'''
print(re.sub("[\s\S]+\*/", "", strng))
# extra text.
Add in a .strip() if you want to remove that remaining leading whitespace.

to keep text until that symbol you can do:
split_str = string.split(' ')
boundary = split_str.index('*/')
new = ' '.join(split_str[0:boundary])
print(new)
which gives you:
something
other things
etc.

string_list = string.split('*/')[1:]
string = '*/'.join(string_list)
print(string)
gives output as
' extra text. \n'

Python RegexHelp

I have an sentence and want to run the regex on it, to match a word.
Test Inputs :
This is about CHG6784532
Starting CHG4560986.
Code Snippet:
regVal = re.compile(r"(CHG\w+)")
for i in text:
if regVal.search(i):
print(i)
Desired Output:
CHG4560986 ( NOT CHG4560986.)
The output the for the first input is apt, it prints "CHG6784532" but the second prints "CHG4560986.",I tried adding ^ $ to the regex but still its not helping. Is there something I am missing here.
Thanks!

Make sure text is a string variable (if it is a list use " ".join(text) instead of text in the code below) and then you may use
import re
text="This is about CHG6784532\nStarting CHG4560986."
regVal = re.compile(r"CHG\w+")
res = regVal.findall(text)
print(res)
# => ['CHG6784532', 'CHG4560986']
See the Python demo.
Details
regVal = re.compile(r"CHG\w+") - the regVal variable is declared that holds the CHG\w+ pattern: it matches CHG and then 1+ word chars
res = regVal.findall(text) finds all the matching substrings in text variable and saves them in res variable

get all occurence of a regex in string python

I am trying to find in the following string TreeModel/Node/Node[1]/Node[4]/Node[1] this :
TreeModel/Node
TreeModel/Node/Node[1]
TreeModel/Node/Node[1]/Node[4]
TreeModel/Node/Node[1]/Node[4]/Node[1]
Using regular expression in python. Here is the code I tried:
string = 'TreeModel/Node/Node[1]/Node[4]/Node[1]'
pattern = r'.+?Node\[[1-9]\]'
print re.findall(pattern=pattern,string=string)
#result : ['TreeModel/Node/Node[1]', '/Node[4]', '/Node[1]']
#expected result : ['TreeModel/Node', 'TreeModel/Node/Node[1]', 'TreeModel/Node/Node[1]/Node[4]', 'TreeModel/Node/Node[1]/Node[4]/Node[1]']

You can use split here:
>>> s = 'TreeModel/Node/Node[1]/Node[4]/Node[1]'
>>> split_s = s.split('/')
>>> ['/'.join(split_s[:i]) for i in range(2, len(split_s)+1)]
['TreeModel/Node',
'TreeModel/Node/Node[1]',
'TreeModel/Node/Node[1]/Node[4]',
'TreeModel/Node/Node[1]/Node[4]/Node[1]']
You can also use regex:
for i in range(2, s.count('/')+2):
s_ = '[^/]+/*'
regex = re.search(r'('+s_*i+')', s).group(0)
print(regex)
TreeModel/Node/
TreeModel/Node/Node[1]/
TreeModel/Node/Node[1]/Node[4]/
TreeModel/Node/Node[1]/Node[4]/Node[1]

I'm not good in Python at all but for regex part with your specific structure of string below regex matches each segment:
/?(?:{[^{}]*})?[^/]+
Where braces and preceding / is optional. It matches a slash mark (if any) then braces with their content (if any) then the rest up to next slash mark.
Python code (see live demo here):
matches = re.findall(r'/?(?:{[^{}]*})?[^/]+', string)
output = ''
for i in range(len(matches)):
output += matches[i];
print(output)

Extracting strings with labels in python using regex

I want to extract strings with labels from text data in python.I have written following code written , but this replaces the actual data with the string , i want to extract that
import re
def replace_entities(example):
# dd mm yyyy
example = re.sub("(\d{1,31}(:? |\-|\/)\d{1,12}(:? |\-|\/)\d{4})", "DATESTR", example) # dd/mm/yyyy
example = re.sub("(\d{4}(:? |\-|\/)\d{1,31}(:? |\-|\/)\d{1,12})", "DATESTR", example) # yyyy/dd/mm
# email id
example = re.sub("[\w\.-]+#[\w\.-]+", "EMAILIDSTR", example)
# URL
example = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', "URLSTR",
example)
example = re.sub('www.(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', "URLSTR", example)
# TIME
example = re.sub("\d{2}:\d{2} (:?AM|PM|am|pm)", "TIMESTR", example)
example = re.sub("\d{2}:\d{2}:\d{3} (:?AM|PM|am|pm)", "TIMESTR", example)
# MONEY
example = re.sub(r'\£ \d+', "MONEYSTR", example, 0)
example = re.sub(r'\£\d+', "MONEYSTR", example, 0)
example = re.sub(r'\d+(:?\£|pound|pounds|euros|euro)', "MONEYSTR", example, 0)
example = re.sub(r'\d+ (:?\£|pound|pounds|euros|euro)', "MONEYSTR", example, 0)
example = re.sub(r'\d.\d+(:?\£|pound|pounds|euros|euro)', "MONEYSTR", example, 0)
example = re.sub(r'\d.\d+ (:?\£|pound|pounds|euros|euro)', "MONEYSTR", example, 0)
example = re.sub(r'\\xc2\\xa\d+', "MONEYSTR", example, 0)
example = re.sub(r'\\xc2\\xa\d+.\d+', "MONEYSTR", example, 0)
# Split alpha numeric and sp. symbol
example = " ".join(re.findall(r"[^,.:;\(\)\/\\_]+|[,.:;\(\)\/\\_]", example))
example = " ".join(re.findall(r"[^\d_]+|\d+", example))
example = re.sub('(?!^)([A-Z][a-z]+)', r' \1', example)
# NUMBERS
example = re.sub(r'\d+', 'NUMSTR', example)
return example
I have following text as input :
My name is ali, Date is 21/08/2018 Total amount is euros 10 . Account number is 123456
Expected_output is:
> 21/08/2018: DATESTR
euros 10 : MONEYSTR
123456 : NUMSTR
How can i get above output
Any ideas?

You may fix it by adding .*? before and .* after the pattern you have and replace with r'\1 : DATESTR'
res = re.sub(r'.*?(\d{1,31}(?::? |[-/])\d{1,12}(?::? |[-/])\d{4}).*', r'\1 : DATESTR', s)
See the regex demo. With .*? you match any 0+ chars other than line break chars, as few as possible, and with .* you match any 0+ chars other than line break chars, as many as possible, and that way you remove what you do not need by just matching and you keep what you capture.
You may also use your regex to extract the date and then append : DATESTR to it:
import re
rx = r"\d{1,31}(?::? |[-/])\d{1,12}(?::? |[-/])\d{4}"
s = "My name is ALi Date is 09/03/2018"
m = re.search(rx, s)
if m:
print("{} : DATESTR".format(m.group())) # => 09/03/2018 : DATESTR
See the Python demo.

You can give a try to datefinder
Here I have tried to get your example done with it:
>>> import datefinder
>>> str = 'My name is ALi Date is 09/03/2018'
>>> matches = datefinder.find_dates(str)
>>> for i in matches:
... print(i.strftime("%m/%d/%Y") + ':DATESTR')
...
09/03/2018:DATESTR
I guess this will help you. It can get any date string out of your string.

from your example you want to do 2 things:
Find a date-like string
Add another string at the end of your match
The solution I propose here might not be the best but it does the thing. I propose you get the match that your regex can find, and then use that match to format whatever you want to print.
import re
string1 = "My name is ALi Date is 09/03/2018"
string2 = "DATESTR"
m = re.search("(\d{1,31}(:? |\-|\/)\d{1,12}(:? |\-|\/)\d{4})", string1 ) # match the date : dd/mm/yyyy
print( m.group(0) + ' : ' + string2 )
The output is:
>>> 09/03/2018 : DATESTR
There might be some other functions that fit your needs in the documentation. That's what I just used.
https://docs.python.org/3/library/re.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to substitute character in string using regex - python

Related

Regex to ignore Semicolon

How to remove text before a particular character or string in multi-line text?

Python RegexHelp

get all occurence of a regex in string python

Extracting strings with labels in python using regex

Categories

Resources