Check empty lines in Python - python

I have a text file which contains multiple number of lines. I want to check a specific line (calling xyz ...) in present or not in between two line(++ start line and -- exiting line). If the line (calling xyz ...) is present then it should return the line and if line is not present then NULL value should be return. I want to store the result in to a list.
Example file:
++ start line
22 15:36:53
dog, cat, monkey, rat
calling xxxxx
animal already added
-- exiting line
Above block of line should add calling xxxxx to list.
++ start line
12 12:56:34
cat, camel, cow, dog
animal already added
-- exiting line
In above block calling xyz is missing so it should add NULL to the list
Expected Output
calling xxxxx
NULL

You can use this regex to check the condition that you have mentioned:
^\+\+(?=(?:(?!\-\-).)*\s+(calling[^\n]+)).*?\s+--
Observe how the regex works here
If it matches then you get the calling line as group 1
Sample Source ( run here ):
import re
regex = r"(?:^\+\+(?=(?:(?!\-\-).)*\s+(calling[^\n]+)).*?\s+--)|(?:^\+\+(?=(?:(?!\-\-).)*\s+(?!calling[^\n]+)).*?\s+--)"
test_str = ("++ start line \n"
"22 15:36:53 \n"
"dog, cat, monkey, rat\n"
"calling xxxxx\n"
"animal already added\n"
"-- exiting line\n\n\n"
"++ start line \n"
"12 12:56:34 \n"
"cat, camel, cow, dog \n"
"animal already added\n"
"-- exiting line\n\n"
"++ start line \n"
"12 12:56:34 \n"
"cat, camel, cow, dog \n"
"calling pqr \n"
"animal already added\n"
"-- exiting line\n\n")
matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)
for match in matches:
print(match.group(1))
Output:
calling xxxxx
None
calling pqr

You may want to use multiple patterns, one for separating the block, one for search calling... in the block.
Expression for the block (see a demo here):
^\+\+
(?P<block>[\s\S]+?)
^--.+
Expression for calling...:
^calling.+
As a Python snippet:
import re
rx_block = re.compile(r'''
^\+\+
(?P<block>[\s\S]+?)
^--.+''', re.MULTILINE | re.VERBOSE)
rx_calling = re.compile(r'''
^calling.+
''', re.MULTILINE | re.VERBOSE)
numbers = [number.group(0) if number else None
for block in rx_block.finditer(your_string_here)
for number in [rx_calling.search(block.group('block'))]]
print(numbers)
Which yields
['calling xxxxx', None]

One can use split function to get sub-parts and check them:
outlist = []
with open("calling.txt", "r") as ff:
lines = ff.read()
records = lines.split("++ start line ")
records = list(filter(lambda x: len(x)>0, records))
for rec in records:
found = False
rows = rec.split("\n")
for row in rows:
if not found and row.startswith("calling"):
outlist.append(row.split(" ")[1])
found = True
if not found:
outlist.append("NULL")
print(outlist)
Output:
['xxxxx', 'NULL', 'pqr']

Related

Python regex pattern in order to find if a code line is finishing with a space or tab character

Sorry for putting such a low level question but I really tried to look for the answer before coming here...
Basically I have a script which is searching inside .py files and reads line by line there code -> the object of the script is to find if a line is finishing with a space or a tab as in the below example
i = 5
z = 25
Basically afte r the i variable we should have a \s and after z variable a \t . ( i hope the code format will not erase it)
def custom_checks(file, rule):
"""
#param file: file: file in-which you search for a specific character
#param rule: the specific character you search for
#return: dict obj with the form { line number : character }
"""
rule=re.escape(rule)
logging.info(f" File {os.path.abspath(file)} checked for {repr(rule)} inside it ")
result_dict = {}
file = fileinput.input([file])
for idx, line in enumerate(file):
if re.search(rule, line):
result_dict[idx + 1] = str(rule)
file.close()
if not len(result_dict):
logging.info("Zero non-compliance found based on the rule:2 consecutive empty rows")
else:
logging.warning(f'Found the next errors:{result_dict}')
After that if i will check the logging output i will see this:
checked for '\+s\\s\$' inside it i dont know why the \ are double
Also basically i get all the regex from a config.json which is this one:
{
"ends with tab":"+\\t$",
"ends with space":"+s\\s$"
}
Could some one help me please in this direction-> I basically know that I may do in other ways such as reverse the line [::-1] get the first character and see if its \s etc but i really wanna do it with regex.
Thanks!
Try:
rules = {
'ends with tab': re.compile(r'\t$'),
'ends with space': re.compile(r' $'),
}
Note: while getting lines from iterating the file will leave newline ('\n') at the end of each string, $ in a regex matches the position before the first newline in the string. Thus, if using regex, you don't need to explicitly strip newlines.
if rule.search(line):
...
Personally, however, I would use line.rstrip() != line.rstrip('\n') to flag trailing spaces of any kind in one shot.
If you want to directly check for specific characters at the end of the line, you then need to strip any newline, and you need to check if the line isn't empty. For example:
char = '\t'
s = line.strip('\n')
if s and s[-1] == char:
...
Addendum 1: read rules from JSON config
# here from a string, but could be in a file, of course
json_config = """
{
"ends with tab": "\\t$",
"ends with space": " $"
}
"""
rules = {k: re.compile(v) for k, v in json.loads(json_config).items()}
Addendum 2: comments
The following shows how to comment out a rule, as well as a rule to detect comments in the file to process. Since JSON doesn't support comments, we can consider yaml instead:
yaml_config = """
ends with space: ' $'
ends with tab: \\t$
is comment: ^\\s*#
# ignore: 'foo'
"""
import yaml
rules = {k: re.compile(v) for k, v in yaml.safe_load(yaml_config).items()}
Note: 'is comment' is easy. A hypothetical 'has comment' is much harder to define -- why? I'll leave that as an exercise for the reader ;-)
Note 2: in a file, the yaml config would be without double backslash, e.g.:
cat > config.yml << EOF
ends with space: ' $'
ends with tab: \t$
is comment: ^\s*#
# ignore: 'foo'
EOF
Additional thought
You may want to give autopep8 a try.
Example:
cat > foo.py << EOF
# this is a comment
text = """
# xyz
bar
"""
def foo():
# to be continued
pass
def bar():
pass
EOF
Note: to reveal the extra spaces:
cat foo.py | perl -pe 's/$/|/'
# this is a comment |
|
text = """|
# xyz |
bar |
"""|
def foo(): |
# to be continued |
pass |
|
def bar():|
pass |
|
|
|
There are several PEP8 issues with the above (extra spaces at end of lines, only 1 line between the functions, etc.). Autopep8 fixes them all (but correctly leaves the text variable unchanged):
autopep8 foo.py | perl -pe 's/$/|/'
# this is a comment|
|
text = """|
# xyz |
bar |
"""|
|
|
def foo():|
# to be continued|
pass|
|
|
def bar():|
pass|

Finding data in-between two strings in python

I have a text file which contain some format like :
PAGE(leave) 'Data1'
line 1
line 2
line 2
...
...
...
PAGE(enter) 'Data1'
I need to get all the lines in between the two keywords and save it a text file. I have come across the following so far. But I have an issue with single quotes as regular expression thinks it as the quote in the expression rather than the keyword.
My codes so far:
log_file = open('messages','r')
data = log_file.read()
block = re.compile(ur'PAGE\(leave\) \'Data1\'[\S ]+\s((?:(?![^\n]+PAGE\(enter\) \'Data1\').)*)', re.IGNORECASE | re.DOTALL)
data_in_home_block=re.findall(block, data)
file = 0
make_directory("home_to_home_data",1)
for line in data_in_home_block:
file = file + 1
with open("home_to_home_" + str(file) , "a") as data_in_home_to_home:
data_in_home_to_home.write(str(line))
It would be great if someone could guide me how to implement it..
As pointed out by #JoanCharmant, it is not necessary to use regex for this task, because the records are delimited by fixed strings.
Something like this should be enough:
messages = open('messages').read()
blocks = [block.rpartition(r"PAGE\(enter\) 'Data1'")[0]
for block in messages.split(r"PAGE\(leave\) 'Data1'")
if block and not block.isspace()]
for count, block in enumerate(blocks, 1):
with open('home_to_home_%d' % count, 'a') as stream:
stream.write(block)
If it's single quotes what worry you, you can start the regular expression string with double quotes...
'hello "howdy"' # Correct
"hello 'howdy'" # Correct
Now, there are more issues here... Even when declared asr, you still must escape your regular expression's backslashes in the .compile (see What does the "r" in pythons re.compile(r' pattern flags') mean? ) Is just that without the r, you probably would need a lot more of backslashes.
I've created a test file with two "sections":
PAGE\(leave\) 'Data1'
line 1
line 2
line 3
PAGE\(enter\) 'Data1'
PAGE\(leave\) 'Data1'
line 4
line 5
line 6
PAGE\(enter\) 'Data1'
The code below will do what you want (I think)
import re
log_file = open('test.txt', 'r')
data = log_file.read()
log_file.close()
block = re.compile(
ur"(PAGE\\\(leave\\\) 'Data1'\n)"
"(.*?)"
"(PAGE\\\(enter\\\) 'Data1')",
re.IGNORECASE | re.DOTALL | re.MULTILINE
)
data_in_home_block = [result[1] for result in re.findall(block, data)]
for data_block in data_in_home_block:
print "Found data_block: %s" % (data_block,)
Outputs:
Found data_block: line 1
line 2
line 3
Found data_block: line 4
line 5
line 6

Multi-line Matching in Python

I've read all of the articles I could find, even understood a few of them but as a Python newb I'm still a little lost and hoping for help :)
I'm working on a script to parse items of interest out of an application specific log file, each line begins with a time stamp which I can match and I can define two things to identify what I want to capture, some partial content and a string that will be the termination of what I want to extract.
My issue is multi-line, in most cases every log line is terminated with a newline but some entries contain SQL that may have new lines within it and therefore creates new lines in the log.
So, in a simple case I may have this:
[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)
This all appears as one line which I can match with this:
re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2}).*(milliseconds)')
However in some cases there may be line breaks in the SQL, as such I want to still capture it (and potentially replace the line breaks with spaces). I am currently reading the file a line at a time which obviously isn't going to work so...
Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?
How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?
My overall goal is to parameterize this so I can use it for extracting log entries that match different patterns of the starting string (always the start of a line), the ending string (where I want to capture to) and a value that is between them as an identifier.
Thanks in advance for any help!
Chris.
import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
lines = []
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')
lines = []
with open(logFileName, 'r') as f:
for line in f:
if lineStartsWith.match(line) and lineContains.match(line):
if lineEndsWith.match(line) :
print 'Full Line Found'
print line
print "- Record Separator -"
else:
print 'Partial Line Found'
print line
print "- Record Separator -"
print "--- DONE ----"
Next step, for my partial line I'll continue reading until I find lineEndsWith and assemble the lines in to one block.
I'm no expert so suggestions are always welcome!
UPDATE - So I have it working, thanks to all the responses that helped direct things, I realize it isn't pretty and I need to clean up my if / elif mess and make it more efficient but IT's WORKING! Thanks for all the help.
import sys, getopt, os, re
sourceFolder = 'C:/MaxLogs'
logFileName = sourceFolder + "/Test.log"
print "--- START ----"
lineStartsWith = re.compile('\[(0?[1-9]|[12][0-9]|3[01])(\/)(0?[1-9]|[12][0-9]|3[01])(\/)([0-9]{2})(\ )')
lineContains = re.compile('.*BMXAA6720W.*')
lineEndsWith = re.compile('(?:.*milliseconds.*)')
lines = []
multiLine = False
with open(logFileName, 'r') as f:
for line in f:
if lineStartsWith.match(line) and lineContains.match(line) and lineEndsWith.match(line):
lines.append(line.replace("\n", " "))
elif lineStartsWith.match(line) and lineContains.match(line) and not multiLine:
#Found the start of a multi-line entry
multiLineString = line
multiLine = True
elif multiLine and not lineEndsWith.match(line):
multiLineString = multiLineString + line
elif multiLine and lineEndsWith.match(line):
multiLineString = multiLineString + line
multiLineString = multiLineString.replace("\n", " ")
lines.append(multiLineString)
multiLine = False
for line in lines:
print line
Do I need to process the whole file in one go? They are typically 20mb in size. How do I read the entire file and iterate through it looking for single or multi-line blocks?
There are two options here.
You could read the file block by block, making sure to attach any "leftover" bit at the end of each block to the start of the next one, and search each block. Of course you will have to figure out what counts as "leftover" by looking at what your data format is and what your regex can match, and in theory it's possible for multiple blocks to all count as leftover…
Or you could just mmap the file. An mmap acts like a bytes (or like a str in Python 2.x), and leaves it up to the OS to handle paging blocks in and out as necessary. Unless you're trying to deal with absolutely huge files (gigabytes in 32-bit, even more in 64-bit), this is trivial and efficient:
with open('bigfile', 'rb') as f:
with mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ) as m:
for match in compiled_re.finditer(m):
do_stuff(match)
In older versions of Python, mmap isn't a context manager, so you'll need to wrap contextlib.closing around it (or just use an explicit close if you prefer).
How would I write a multi-line RegEx that would match either the whole thing on one line or of it is spread across multiple lines?
You could use the DOTALL flag, which makes the . match newlines. You could instead use the MULTILINE flag and put appropriate $ and/or ^ characters in, but that makes simple cases a lot harder, and it's rarely necessary. Here's an example with DOTALL (using a simpler regexp to make it more obvious):
>>> s1 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and (exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)"""
>>> s2 = """[8/21/13 11:30:33:557 PDT] 00000488 SystemOut O 21 Aug 2013 11:30:33:557 [WARN] [MXServerUI01] [CID-UIASYNC-17464] BMXAA6720W - USER = (ABCDEF) SPID = (2526) app (ITEM) object (ITEM) : select * from item where ((status != 'OBSOLETE' and itemsetid = 'ITEMSET1') and
(exists (select 1 from maximo.invvendor where (exists (select 1 from maximo.companies where (( contains(name,' $AAAA ') > 0 )) and (company=invvendor.manufacturer and orgid=invvendor.orgid))) and (itemnum = item.itemnum and itemsetid = item.itemsetid)))) and (itemtype in (select value from synonymdomain where domainid='ITEMTYPE' and maxvalue = 'ITEM')) order by itemnum asc (execution took 2083 milliseconds)"""
>>> r = re.compile(r'\[(.*?)\].*?milliseconds\)', re.DOTALL)
>>> r.findall(s1)
['8/21/13 11:30:33:557 PDF']
>>> r.findall(s2)
['8/21/13 11:30:33:557 PDF']
As you can see the second .*? matched the newline just as easily as a space.
If you're just trying to treat a newline as whitespace, you don't need either; '\s' already catches newlines.
For example:
>>> s1 = 'abc def\nghi\n'
>>> s2 = 'abc\ndef\nghi\n'
>>> r = re.compile(r'abc\s+def')
>>> r.findall(s1)
['abc def']
>>> r.findall(s2)
['abc\ndef']
You can read an entire file into a string and then you can use re.split to make a list of all the entries separated by times. Here's an example:
f = open(...)
allLines = ''.join(f.readlines())
entries = re.split(regex, allLines)

delete only lines after match1 up to match2

I have checked and played with various examples and it appears that my problem is a bit more complex than what I have been able to find. What I need to do is search for a particular string and then delete the following line and keep deleting lines until another string is found. So an example would be the following:
a
b
color [
0 0 0,
1 1 1,
3 3 3,
] #color
y
z
Here, "color [" is match1, and "] #color" is match2. So then what is desired is the following:
a
b
color [
] #color
y
z
This "simple to follow" code example will get you started .. you can tweak it as needed. Note that it processes the file line-by-line, so this will work with any size file.
start_marker = 'startdel'
end_marker = 'enddel'
with open('data.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if end_marker in line:
ignoreLines = False
if not ignoreLines:
print line,
It uses startdel and enddel as "markers" for starting and ending the ignoring of data.
Update:
Modified code based on a request in the comments, this will now include/print the lines that contain the "markers".
Given this input data (borrowed from #drewk):
Beginning of the file...
stuff
startdel
delete this line
delete this line also
enddel
stuff as well
the rest of the file...
it yields:
Beginning of the file...
stuff
startdel
enddel
stuff as well
the rest of the file...
You can do this with a single regex by using nongreedy *. E.g., assuming you want to keep both the "look for this line" and the "until this line is found" lines, and discard only the lines in between, you could do:
>>> my_regex = re.compile("(look for this line)"+
... ".*?"+ # match as few chars as possible
... "(until this line is found)",
... re.DOTALL)
>>> new_str = my_regex.sub("\1\2", old_str)
A few notes:
The re.DOTALL flag tells Python that "." can match newlines -- by default it matches any character except a newline
The parentheses define "numbered match groups", which are then used later when I say "\1\2" to make sure that we don't discard the first and last line. If you did want to discard either or both of those, then just get rid of the \1 and/or the \2. E.g., to keep the first but not the last use my_regex.sub("\1", old_str); or to get rid of both use my_regex.sub("", old_str)
For further explanation, see: http://docs.python.org/library/re.html or search for "non-greedy regular expression" in your favorite search engine.
This works:
s="""Beginning of the file...
stuff
look for this line
delete this line
delete this line also
until this line is found
stuff as well
the rest of the file... """
import re
print re.sub(r'(^look for this line$).*?(^until this line is found$)',
r'\1\n\2',s,count=1,flags=re.DOTALL | re.MULTILINE)
prints:
Beginning of the file...
stuff
look for this line
until this line is found
stuff as well
the rest of the file...
You can also use list slices to do this:
mStart='look for this line'
mStop='until this line is found'
li=s.split('\n')
print '\n'.join(li[0:li.index(mStart)+1]+li[li.index(mStop):])
Same output.
I like re for this (being a Perl guy at heart...)

Python - go to two lines above match

In a text file like this:
First Name last name #
secone name
Address Line 1
Address Line 2
Work Phone:
Home Phone:
Status:
First Name last name #
....same as above...
I need to match string 'Work Phone:' then go two lines up and insert character '|' in the begining of line. so pseudo code would be:
if "Work Phone:" in line:
go up two lines:
write | + line
write rest of the lines.
File is about 10 mb and there are about 1000 paragraphs like this.
Then i need to write it to another file. So desired result would be:
First Name last name #
secone name
|Address Line 1
Address Line 2
Work Phone:
Home Phone:
Status:
thanks for any help.
This solution doesn't read whole file into memory
p=""
q=""
for line in open("file"):
line=line.rstrip()
if "Work Phone" in line:
p="|"+p
if p: print p
p,q=q,line
print p
print q
output
$ python test.py
First Name last name #
secone name
|Address Line 1
Address Line 2
Work Phone:
Home Phone:
Status:
You can use this regex
(.*\n){2}(Work Phone:)
and replace the matches with
|\1\2
You don't even need Python, you can do such a thing in any modern text editor, like Vim.
Something like this?
lines = text.splitlines()
for i, line in enumerate(lines):
if 'Work Phone:' in line:
lines[i-2] = '|' + lines[i-2]

Categories