Find all strings in text file fitting either of two formats

Find all strings in text file fitting either of two formats - python

So I know similar questions have been asked before, but every method I have tried is not working...
Here is the ask: I have a text file (which is a log file) that I am parsing for any occurrence of "app.task2". The following are the 2 scenarios that can occur (As they appear in the text file, independent of my code):
Scenario 1:
Mar 23 10:28:24 dasd[116] <Notice>: app.task2.refresh:556A2D:[
{name: ApplicationPolicy, policyWeight: 50.000, response: {Decision: Can Proceed, Score: 0.45}}
] sumScores:68.785000, denominator:96.410000, FinalDecision: Can Proceed FinalScore: 0.713463}
Scenario 2:
Mar 23 10:35:56 dasd[116] <Notice>: 'app.task2.refresh:C6C2FE' CurrentScore: 0.636967, ThresholdScore: 0.410015 DecisionToRun:1
The problem I am facing is that my current code below, I am not getting the entire log entry for the first case, and it is only pulling the first line in the log, not the remainder of the log entry, and it appears to be stopping at the new line escape character, which is occurring after ":[".
My Code:
all = []
with open(path_to_log) as f:
for line in f:
if "app.task2" in line:
all.append(line)
print all
How can I get the entire log entry for the first case? I tried stripping escape characters with no luck. From here I should be able to parse the list of results returned for what I truly need, but this will help! ty!
OF NOTE: I need to be able to locate these types of log entries (which will then give us either scenario 1 or scenario 2) by the string "app.task2". So this needs to be incorporated, like in my example...

Before adding the line to all, check if it ends with [. If it does, keep reading and merge the lines until you get to ].
import re
all = []
with open(path_to_log) as f:
for line in f:
if "app.task2" in line:
if re.search(r'\[\s*$', line): # start of multiline log message
for line2 in f:
line += line2
if re.search(r'^\s*\]', line2): # end of multiline log message
break
all.append(line)
print(all)

You are iterating over each each line individually which is why you only get the first line in scenario 1.
Either you can add a counter like this:
all = []
count = -1
with open(path_to_log) as f:
for line in f:
if count > 0:
all.append(line)
if count == 1:
tmp = all[-count:]
del all[-count:]
all.append("\n".join(tmp))
count -= 1
continue
if "app.task2" in line:
all.append(line)
if line.endswith('[\n'):
count = 3
print all
In this case i think Barmar solution would work just as good.
Or you can (preferably) when storing the log file have some distinct delimiter between each log entry and just split the log file by this delimiter.

I like #Barmar's solution with nested loops on the same file object, and may use that technique in the future. But prior to seeing I would have done it with a single loop, which may or may not be more readable:
all = []
keep = False
for line in open(path_to_log,"rt"):
if "app.task2" in line:
all.append(line)
keep = line.rstrip().endswith("[")
elif keep:
all.append(line)
keep = not line.lstrip().startswith("]")
print (all)
or, you can print it nicer with:
print(*all,sep='\n')

Related

Trying to skip over several lines but the skipped lines are still being worked on

My program takes an input file, reads the file using whitespace as the delimiter and puts the data into an array, then I want to iterate over each line and if certain strings are found write that info to another file.
When a specific string is found, I want to skip over several lines, meaning that these lines are NOT iterated over. I thought that if I increased the 'line' variable (i) that would do it, but despite the fact that i is increased by 50, those 50 lines are still being worked on, which is not what I want.
Hopefully I have explained this problem well. Thank you in advance for your feedback.
def create_outfile(infile):
gto_found = 0
outfile = "output.txt" # Output file
outfile = open(outfile,'w') # Open output file for writing
for i in range(len(infile)): # iterate over each line
if len(infile[i]) == 6:
if (infile[i][4][1:-1]) == "GTO" and gto_found == 0: # now skip
print (i)
print (infile[i])
debugPause = input("\nPausing to debug...\n")
i = i + 50 # Skip over the GTO section
gto_found = 1
print (i)
debugPause = input("\nPausing to debug...\n")
print (infile[i])
for j in range(len(infile[i])): # iterate over each element
# Command section
if (infile[i][j])[:5] == "#ACS_":
# Do some work

Unfortunately, python does not allow a for loop to jump up like that. The variable i cannot be edited inside the loop. This is same as this question here, so check it out. This other topic shows some work around that you could use.

Error with .readlines()[n]

I'm a beginner with Python.
I tried to solve the problem: "If we have a file containing <1000 lines, how to print only the odd-numbered lines? ". That's my code:
with open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')as f:
n=1
num_lines=sum(1 for line in f)
while n<num_lines:
if n/2!=0:
a=f.readlines()[n]
print(a)
break
n=n+2
where n is a counter and num_lines calculates how many lines the file contains.
But when I try to execute the code, it says:
"a=f.readlines()[n]
IndexError: list index out of range"
Why it doesn't recognize n as a counter?

You have the call to readlines into a loop, but this is not its intended use,
because readlines ingests the whole of the file at once, returning you a LIST
of newline terminated strings.
You may want to save such a list and operate on it
list_of_lines = open(filename).readlines() # no need for closing, python will do it for you
odd = 1
for line in list_of_lines:
if odd : print(line, end='')
odd = 1-odd
Two remarks:
odd is alternating between 1 (hence true when argument of an if) or 0 (hence false when argument of an if),
the optional argument end='' to the print function is required because each line in list_of_lines is terminated by a new line character, if you omit the optional argument the print function will output a SECOND new line character at the end of each line.
Coming back to your code, you can fix its behavior using a
f.seek(0)
before the loop to rewind the file to its beginning position and using the
f.readline() (look, it's NOT readline**S**) method inside the loop,
but rest assured that proceding like this is. let's say, a bit unconventional...
Eventually, it is possible to do everything you want with a one-liner
print(''.join(open(filename).readlines()[::2]))
that uses the slice notation for lists and the string method .join()

Well, I'd personally do it like this:
def print_odd_lines(some_file):
with open(some_file) as my_file:
for index, each_line in enumerate(my_file): # keep track of the index of each line
if index % 2 == 1: # check if index is odd
print(each_line) # if it does, print it
if __name__ == '__main__':
print_odd_lines('C:\Users\Savina\Desktop\rosalind_ini5.txt')
Be aware that this will leave a blank line instead of the even number. I'm sure you figure how to get rid of it.

This code will do exactly as you asked:
with open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')as f:
for i, line in enumerate(f.readlines()): # Iterate over each line and add an index (i) to it.
if i % 2 == 0: # i starts at 0 in python, so if i is even, the line is odd
print(line)
To explain what happens in your code:
A file can only be read through once. After that is has to be closed and reopened again.
You first iterate over the entire file in num_lines=sum(1 for line in f). Now the object f is empty.
If n is odd however, you call f.readlines(). This will go through all the lines again, but none are left in f. So every time n is odd, you go through the entire file. It is faster to go through it once (as in the solutions offered to your question).

As a fix, you need to type
f.close()
f = open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')
everytime after you read through the file, in order to get back to the start.
As a side note, you should look up modolus % for finding odd numbers.

Slice variable from specified letter to specified letter in line that varies in length

New to the site so I apologize if I format this incorrectly.
So I'm searching a file for lines containing
Server[x] ip.ip.ip.ip response=235ms accepted....
where x can be any number greater than or equal to 0, then storing that information in a variable named line.
I'm then printing this content to a tkinter GUI and its way too much information for the window.
To resolve this I thought I would slice the information down with a return line[15:30] in the function but the info that I want off these lines does not always fall between 15 and 30.
To resolve this I tried to make a loop with
return line[cnt1:cnt2]
checked cnt1 and cnt2 in a loop until cnt1 meets "S" and cnt2 meets "a" from accepted.
The problem is that I'm new to Python and I cant get the loop to work.
def serverlist(count):
try:
with open("file.txt", "r") as f:
searchlines = f.readlines()
if 'f' in locals():
for i, line in enumerate(reversed(searchlines)):
cnt = 90
if "Server["+str(count)+"]" in line:
if line[cnt] == "t":
cnt += 1
return line[29:cnt]
except WindowsError as fileerror:
print(fileerror)
I did a reversed on the line reading because the lines I am looking for repeats over and over every couple of minutes in the text file.
Originally I wanted to scan from the bottom and stop when it got to server[0] but this loop wasn't working for me either.
I gave up and started just running serverlist(count) and specifying the server number I was looking for instead of just running serverlist().
Hopefully when I understand the problem with my original loop I can fix this.
End goal here:
file.txt has multiple lines with
<timestamp/date> Server[x] ip.ip.ip.ip response=<time> accepted <unneeded garbage>
I want to cut just the Server[x] and the response time out of that line and show it somewhere else using a variable.
The line can range from Server[0] to Server[999] and the same response times are checked every few minutes so I need to avoid duplicates and only get the latest entries at the bottom of the log.
Im sorry this is lengthy and confusing.
EDIT:
Here is what I keep thinking should work but it doesn't:
def serverlist():
ips = []
cnt = 0
with open("file.txt", "r") as f:
for line in reversed(f.readlines()):
while cnt >= 0:
if "Server["+str(cnt)+"]" in line:
ips.append(line.split()) # split on spaces
cnt += 1
return ips
My test log file has server[4] through server[0]. I would think that the above would read from the bottom of the file, print server[4] line, then server[3] line, etc and stop when it hits 0. In theory this would keep it from reading every line in the file(runs faster) and it would give me only the latest data. BUT when I run this with while cnt >=0 it gets stuck in a loop and runs forever. If I run it with any other value like 1 or 2 then it returns a blank list []. I assume I am misunderstanding how this would work.

Here is my first approach:
def serverlist(count):
with open("file.txt", "r") as f:
for line in f.readlines():
if "Server[" + str(count) + "]" in line:
return line.split()[1] # split on spaces
return False
print serverlist(30)
# ip.ip.ip.ip
print serverlist(";-)")
# False
You can change the index in line.split()[1] to get the specific space separated string of the line.
Edit: Sure, just remove the if condition to get all ip's:
def serverlist():
ips = []
with open("file.txt", "r") as f:
for line in f.readlines():
if line.strip().startswith("Server["):
ips.append(line.split()[1]) # split on spaces
return ips

Adding non-duplicate strings from one txt to another in Python3.3

I have 2 text files (new.txt and master.txt). Each has different data stored as such:
Cory 12 12:40:12.016221
Suzy 64 12:40:33.404614
Trent 145 12:40:56.640052
(catagorised by the first set of numbers appearing on each line)
I have to scan each line of new.txt for the name (e.g. Suzy), check if there is a duplicate in master.txt and if there isn't, then I add that line to master.txt catagorized by that line's number (e.g. 64 in Suzy 64 12:40:33.404614).
I have written the following script, but it falls into a loop of checking the 1st line of new.txt (I know why, I just don't know how to work around not closing fileinput.input(new.txt) so that I can then open fileinput.input(master.txt) further down the loop). I feel like I've highly over complicated things for myself and any help is appreciated.
import fileinput
import re
end_of_file = False
while end_of_file == False:
for line in fileinput.input('new.txt', inplace=1):
end_of_file = fileinput.isstdin() #ends while loop if on last line of new.txt
user_f_line_list = line.split()
master_f = open('master.txt', 'r')
master_f_read = master_f.read()
master_f.close()
fileinput.close()
if not re.findall(user_f_line_list[0], master_f_read):
for line in fileinput.input('master.txt', inplace=1):
master_line_list = line.split()
if int(user_f_line_list[1]) <= int(master_line_list[1]):
written = False
while written == False:
written = True
print(' '.join(user_f_line_list))
print(line, end='')
fileinput.close()
And for reference, master.txt starts with startline 0 and ends with endline 1000000000000000 so that it is impossible for the categorizing to be out of range.

Some suggestions:
Open master.txt into a list with readlines().
Use an OrderedDict from the collections module - it is the same as a regular dict but preserves the order. Make each key the unique element - a tuple in this case (e.g. ("Cory", 12)). Make the value whatever comes after.
Now you can very rapidly check to see if the entry is present by if key in my_dict:.
If it isn't, you can insert it. If you need to insert in order, it'll take a bit more work, but not too much. I would insert in the end, convert to a list when all is done, and apply a sort function to the list with a custom function to specify how to sort.
Output it back to the file.
I won't say it's necessarily shorter than your solution, but it is a lot cleaner.

pulling subset of lines from files

I have files where there is a varying number of header lines in random order followed by the data that I need, which spans the number of lines as given by the corresponding header. ex Lines: 3
from: blah#blah.com
Subject: foobarhah
Lines: 3
Extra: More random stuff
Foo Bar Lines of Data, which take up
some arbitrary long amount characters on a single line, but no matter how long
they still only take up the number of lines as specified in the header
How can I get at that data in one read of the file??
P.S. The data is from the 20Newsgroups corpus.
Edit: The quick solution I guess which only works if I relax the constraint on reading only once is this:
[1st read] Find out total_num_of_lines and match on first Lines: header ,
[2nd read] I discard the first (total_num_of_lines- header_num_of_lines) and then read the rest of the file
I'm still unaware of a way to read in the data in one pass though.

I'm not quite sure you even need the beginning of the file in order to get its contents. Consider using split:
_, contents = file_contents.split(os.linesep + os.linesep) # e.g. \n\n
If, however, the lines parameter does count - you can use the technique suggested above along with parsing file headers:
headers, contents = file_contents.split(os.linesep + os.linesep)
# Get lines length
headers_list = [line.split for line in headers.splitlines()]
lines_count = int([line[1] for line in headers_list if line[0].lower() == 'lines:'][0])
# Get contents
real_contents = contents[:lines_count]

Assuming we have the general case where there could be multiple messages following each other, maybe something like
from itertools import takewhile
def msgreader(file):
while True:
header = list(takewhile(lambda x: x.strip(), file))
if not header: break
header_dict = {k: v.strip() for k,v in (line.split(":", 1) for line in header)}
line_count = int(header_dict['Lines'])
message = [next(file) for i in xrange(line_count)] # or islice..
yield message
would work, where
with open("53903") as fp:
for message in msgreader(fp):
print message
would give all the listed messages. For this particular use case the above would be overkill, but frankly it's not much harder to extract all the header info than it is only the one line. I'd be surprised if there weren't already a module to parse these messages, though.

You need to store the state of whether the headers have finished. That's all.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find all strings in text file fitting either of two formats - python

Related

Trying to skip over several lines but the skipped lines are still being worked on

Error with .readlines()[n]

Slice variable from specified letter to specified letter in line that varies in length

Adding non-duplicate strings from one txt to another in Python3.3

pulling subset of lines from files

Categories

Resources