Trouble with complex multi-line regular expression

Trouble with complex multi-line regular expression - python

I am trying to parse large log in python using complex regular expression.
The sequence I am looking for is on multiple consecutive lines and can have positive or negative outcome. Here is the regex with test data:
Here is my problem: https://regex101.com/r/pF3eM6/1
The regex does what it should, pulling Start and end date, source and target format, job ID and completion time or "Time out". However when the sequence is not complete (like end of file), the expression ends up in error (try to delete the last line). I would like the expression simply ignore such sequence (no match).
And for completeness, here the same regex with "Time Out" variant of the log file - and the same problem as above.
https://regex101.com/r/pF3eM6/2
Can someone help me understand what and why is happening?

Had to rewrite it completely, I read the articles at http://regular-expressions.info and I finally understand how regex does work :)
For posterity, here is the commented solution:
https://regex101.com/r/pF3eM6/8

Related

Modify the data found between two recurring patterns in a multi-line string

I have a multi-line string, it's around to 10000-40000 characters(changes as per the data returned by an API). In this string, there are a number of tables (they are a part of the string, but formatted in a way that makes them look like a table). The tables are always in a repeating pattern. The pattern looks like this:
==============================================================================
*THE HEADINGS/COLUMN NAMES IN THE TABLE*
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
I'm trying to display the contents in html on a locally hosted webpage, and I want to have the heading of the tables displayed in a specific way (think color, font size). For that, I'm using the python regex module to identify the pattern, but I'm failing to do so due to inexperience in using the re module. To modify the part that I need modified, I'm using the below piece of code:
re.sub(r'\={78}.*\-{78}',some_replacement_string, complete_multi_line_string)
But the above piece of code is not giving me the output I require, since it is not matching the pattern properly(I'm sure the mistake is in the pattern I'm asking re.sub to match)
However:
re.sub(r'\-{78}',some_replacement_string, complete_multi_line_string)
is working as it's returning the string with the replacement, but the slight problem here is that there are multiple ------------------------------------------------------------------------------s in the code that I do not want modified. Please help me out here. If it is helpful, the output that I'm wanting is something like:
==============================================================================
<span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>
------------------------------------------------------------------------------
THE DATA IN THE TABLE, FORMATTED TO BE UNDER RESPECTIVE COLUMNS
Also, please note that there are newlines or \ns after the ==============================================================================s, the <span>*THE HEADINGS/COLUMN NAMES IN THE TABLE*<\span>s and the ------------------------------------------------------------------------------s, if that is helpful in getting to the solution.
The code snippet I'm currently trying to debug, if helpful:
result = re.sub(r'\={78}.*\-{78}', replacement, multi_line_string)
l = result.count('<\span>')
print(l)
PS: There are 78 = and 78 - in all the occurances.

You should try using the following version:
re.sub(r'(={78})\n(.*?)\n(-{78})', r'\1<span>\2</span>\3', complete_multi_line_string, flags=re.S)
The changes I made here include:
Match on lazy dot .*? instead of greedy dot .*, to ensure that we don't match across header sections
Match with the re.S flag, so that .*? will match across newlines

How to remove strings between two characters using regular expression python

I am trying to clean up some log and want to extract general information from the message. I am newie to python and just learn regular expression yesterday and now have problems.
My message look like this:
Report ZSIM_RANDOM_DURATION_ started
Report ZSIM_SYSTEM_ACTIVITY started
Report /BDL/TASK_SCHEDULER started
Report ZSIM_JOB_CREATE started
Report RSBTCRTE started
Report SAPMSSY started
Report RSRZLLG_ACTUAL started
Report RSRZLLG started
Report RGWMON_SEND_NILIST started
I try to some code:
clean_special2=re.sub(r'^[Report] [^1-9] [started]','',text)
but I think this code will remove all rows however I want to keep the format like Report .....Started. So I only want to remove the jobs name in the middle.
I expect my outcome looks like this:
Report started
Anyone can help me with a idea? Thank you very much!

Try something like this:
clean_special2=re.sub(r'(?<=^Report\b).*(?=\bstarted)',' ',text)
Explanation: the (?<=...) is a positive lookbehind, e.g. the string must match the content of this group, but it will not be captured and thus not replaced. Same thing on the other side with a positive look-ahead (?=...). The \b is a word boundary, so that everything between these words will be matched. Since this will also trim away the whitespace, the replacement is a single whitespace.

I don't know about the python syntax but I can sure this regexp can help you match your string
/^Report\W+([\w&.##%^!~-]+)\W+started/m*
The python string might be like this
text = "Report ZSIM_RANDOM_DURATION_ started";
clean_special2=re.sub(r'^Report\W+([\w&.##%^!~-]+)\W+started',' ',text)*

This should do... '^Report\ [^\ ]*\ started'
Regex is black magic, only use it when you have to. Online tools make it much easier to write: https://regex101.com/

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.

Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs

Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

Replacing strings in a text and ignoring certain parts

I found many programs online to replace text in a string or file with words prescribed in a dictionary. For example, https://www.daniweb.com/programming/software-development/code/216636/multiple-word-replace-in-text-python
But I was wondering how to get the program to ignore certain parts of the text. For instance, I would like it to ignore parts that are ensconced within say % signs (%Please ignore this%). Better still, how do I get it to ignore the text within but remove the % sign at the end of the run.
Thank you.

This could very easily be done with regular expressions, although they may not be supported by any online programs you find. You will probably need to write something yourself and then use regex as your dict's search key's.
Good place to start playing around with regex is: http://regexr.com

Well in the replacing dictionary just have any word you want to be ignored such as teh be replaced with the but %teh% be replaced with teh. For the program in the link you could have
wordDic = {
'booster': 'rooster',
'%booster%': 'booster'
}

Regex named conditional lookahead (in Python)

I'm hoping to match the beginning of a string differently based on whether a certain block of characters is present later in the string. A very simplified version of this is:
re.search("""^(?(pie)a|b)c.*(?P<pie>asda)$""", 'acaaasda')
Where, if <pie> is matched, I want to see a at the beginning of the string, and if it isn't then I'd rather see b.
I'd use normal numerical lookahead but there's no guarantee how many groups will or won't be matched between these two.
I'm currently getting error: unknown group name. The sinking feeling in my gut tells me that this is because what I want is impossible (look-ahead to named groups isn't exactly a feature of a regular language parser), but I really really really want this to work -- the alternative is scrapping 4 or 5 hours' worth of regex writing and redoing it all tomorrow as a recursive descent parser or something.
Thanks in advance for any help.

Unfortunately, I don't think there is a way to do what you want to do with named groups. If you don't mind duplication too much, you could duplicate the shared conditions and OR the expressions together:
^(ac.*asda|bc.*)$
If it is a complicated expression you could always use string formatting to share it (rather than copy-pasting the shared part):
common_regex = "c.*"
final_regex = "^(a{common}asda|b{common})$".format(common=common_regex)

You can use something like that:
^(?:a(?=c.*(?P<pie>asda)$)|b)c.*$
or without .*$ if you don't need it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble with complex multi-line regular expression - python

Had to rewrite it completely, I read the articles at http://regular-expressions.info and I finally understand how regex does work :) For posterity, here is the commented solution: https://regex101.com/r/pF3eM6/8

Related

Modify the data found between two recurring patterns in a multi-line string

How to remove strings between two characters using regular expression python

Python - Injecting html tags into strings based on regex match

Replacing strings in a text and ignoring certain parts

Regex named conditional lookahead (in Python)

Categories

Resources