I am trying to clean up some log and want to extract general information from the message. I am newie to python and just learn regular expression yesterday and now have problems.
My message look like this:
Report ZSIM_RANDOM_DURATION_ started
Report ZSIM_SYSTEM_ACTIVITY started
Report /BDL/TASK_SCHEDULER started
Report ZSIM_JOB_CREATE started
Report RSBTCRTE started
Report SAPMSSY started
Report RSRZLLG_ACTUAL started
Report RSRZLLG started
Report RGWMON_SEND_NILIST started
I try to some code:
clean_special2=re.sub(r'^[Report] [^1-9] [started]','',text)
but I think this code will remove all rows however I want to keep the format like Report .....Started. So I only want to remove the jobs name in the middle.
I expect my outcome looks like this:
Report started
Anyone can help me with a idea? Thank you very much!
Try something like this:
clean_special2=re.sub(r'(?<=^Report\b).*(?=\bstarted)',' ',text)
Explanation: the (?<=...) is a positive lookbehind, e.g. the string must match the content of this group, but it will not be captured and thus not replaced. Same thing on the other side with a positive look-ahead (?=...). The \b is a word boundary, so that everything between these words will be matched. Since this will also trim away the whitespace, the replacement is a single whitespace.
I don't know about the python syntax but I can sure this regexp can help you match your string
/^Report\W+([\w&.##%^!~-]+)\W+started/m*
The python string might be like this
text = "Report ZSIM_RANDOM_DURATION_ started";
clean_special2=re.sub(r'^Report\W+([\w&.##%^!~-]+)\W+started',' ',text)*
This should do... '^Report\ [^\ ]*\ started'
Regex is black magic, only use it when you have to. Online tools make it much easier to write: https://regex101.com/
Related
I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101
Iam new to Python, And I need to remove space between string and a digit only not between two strings.
eg:
Input : Paragraph 25 is in documents and paragraph number in another file.
Output : Paragraph25 is in documents and paragraph number in another file.
How this can be done in Python ? I tried regex
re.sub("paragraph\s[a-z]", "paragraph[a-z]", Input)
But its not working.
>>> re.sub(r'\s+(\d+)', r'\1', 'Program 25 is fun')
'Program25 is fun'
That might work in a pinch. I'm not the most familiar with regexes, so hopefully someone who is can chime in with something more robust.
Basically we match on whitespace succeeded by numbers and remove it.
I found many programs online to replace text in a string or file with words prescribed in a dictionary. For example, https://www.daniweb.com/programming/software-development/code/216636/multiple-word-replace-in-text-python
But I was wondering how to get the program to ignore certain parts of the text. For instance, I would like it to ignore parts that are ensconced within say % signs (%Please ignore this%). Better still, how do I get it to ignore the text within but remove the % sign at the end of the run.
Thank you.
This could very easily be done with regular expressions, although they may not be supported by any online programs you find. You will probably need to write something yourself and then use regex as your dict's search key's.
Good place to start playing around with regex is: http://regexr.com
Well in the replacing dictionary just have any word you want to be ignored such as teh be replaced with the but %teh% be replaced with teh. For the program in the link you could have
wordDic = {
'booster': 'rooster',
'%booster%': 'booster'
}
I am trying to parse large log in python using complex regular expression.
The sequence I am looking for is on multiple consecutive lines and can have positive or negative outcome. Here is the regex with test data:
Here is my problem: https://regex101.com/r/pF3eM6/1
The regex does what it should, pulling Start and end date, source and target format, job ID and completion time or "Time out". However when the sequence is not complete (like end of file), the expression ends up in error (try to delete the last line). I would like the expression simply ignore such sequence (no match).
And for completeness, here the same regex with "Time Out" variant of the log file - and the same problem as above.
https://regex101.com/r/pF3eM6/2
Can someone help me understand what and why is happening?
Had to rewrite it completely, I read the articles at http://regular-expressions.info and I finally understand how regex does work :)
For posterity, here is the commented solution:
https://regex101.com/r/pF3eM6/8
I want to replace consecutive symbols just one such as;
this is a dog???
to
this is a dog?
I'm using
str = re.sub("([^\s\w])(\s*\1)+", "\\1",str)
however I notice that this might replace symbols in urls that might happen in my text.
like http://example.com/this--is-a-page.html
Can someone give me some advice how to alter my regex?
So you want to unleash the power of regular expressions on an irregular language like HTML. First of all, search SO for "parse HTML with regex" to find out why that might not be such a good idea.
Then consider the following: You want to replace duplicate symbols in (probably user-entered) text. You don't want to replace them inside a URL. How can you tell what a URL is? They don't always start with http – let's say ars.userfriendly.org might be a URL that is followed by a longer path that contains duplicate symbols.
Furthermore, you'll find lots of duplicate symbols that you definitely don't want to replace (think of nested parentheses (like this)), some of them maybe inside a <script> on the page you're working on (||, && etc. come to mind.
So you might come up with something like
(?<!\b(?:ftp|http|mailto)\S+)([^\\|&/=()"'\w\s])(?:\s*\1)+
which happens to work on the source code of this very page but will surely fail in other cases (for example if URLs don't start with ftp, http or mailto). Plus, it won't work in Python since it uses variable repetition inside lookbehind.
All in all, you probably won't get around parsing your HTML with a real parser, locating the body text, applying a regex to it and writing it back.
EDIT:
OK, you're already working on the parsed text, but it still might contain URLs.
Then try the following:
result = re.sub(
r"""(?ix) # case-insensitive, verbose regex
# Either match a URL
# (protocol optional (if so, URL needs to start with www or ftp))
(?P<URL>\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$])
# or
|
# match repeated non-word characters
(?P<rpt>[^\s\w])(?:\s{0,100}(?P=rpt))+""",
# and replace with both captured groups (one will always be empty)
r"\g<URL>\g<rpt>", subject)
Re-EDIT: Hm, Python chokes on the (?:\s*(?P=rpt))+ part, saying the + has nothing to repeat. Looks like a bug in Python (reproducible with (.)(\s*\1)+ whereas (.)(\s?\1)+ works)...
Re-Re-EDIT: If I replace the * with {0,100}, then the regex compiles. But now Python complains about an unmatched group. Obviously you can't reference a group in a replacement if it hasn't participated in the match. I give up... :(