Python script for extract test from log files - python

The problem statement as follows, There is log file contains logs related to testing results. For example it is contains text like testcase1 followed by logs for test case and testcase2 followed by logs for the test case and so on.
If user want to extract log for testcase1 and testcase3, script should read input from user like testcase1 and testcase3. Then extract only the logs for specified testcases.
In this case, assume user enter testcase1 and testcase3, output should be lines below the testcase1 and testcase3 from the log file.

Here you have to go through an assumption that all the testcase logs are in separate lines, which means you are having '\n' after every log line.
Then you can read the files from linecache module.
Now again you should have your logs in a particular format. Here you have mentioned it as [testcaseN] [Log message] and [testcaseN] should have 'testcase' common with 'N' as variable.
So while you fetch all lines using linecache module, use re module for matching testcaseN given as input with the first word of individual lines fetched . Once you get a match, display the result.

Finally got the woking script to extract the text
# Read particular line from the text file
# For example in the program we are readin file and print only lines under the Title 2 and title 4
# log file may contain empty lines as weel
# Example log file
# Title
# 1
# dklfjsdkl;
# g
# sdfzsdfsdf
# sdfsdfsdf
# dsfsdfsd
# dfsdf
#
# title
# 2
#
# dfdf
# dfdf
# dfdf
# df
# dfd
# d
#
# title3
# sdfdfd
# dfdfd
# dfd
#
# dfd
#
# title
# 4
# dfkdfkd
# dfdkjmd
# dfdkljm
in_list= []
while True:
i = raw_input("Enter title to be extracted (or Enter to quit): ")
in_list.append(i)
if not i:
break
print("Your input:", i)
print("While loop has exited")
in_list.remove(i)
print "Input list", in_list
flist = []
with open("C:\\text.txt", 'r') as inp:
#read the flie and storing into the list
flist =inp.readlines()
inp.close()
#making everything in the list to lower case
flist = map(lambda x:x.lower(),flist)
flist = [s.strip("\n") for s in flist]
print flist
# printing the complete log file from the list. Since once we put the vlaue in the list the new line character will be \ appended in the list element.
#hence striping with \n character
# for i in flist:
# print i.strip("\\n")
for j in range(len(in_list)):
result = any(in_list[j] in word for word in flist)
if result:
i_index = flist.index(in_list[j])
flag = 0
with open("C:\\output.txt",'a') as f1:
f1.write(flist[i_index])
f1.write("\n")
while flag ==0:
if "title" in flist[i_index+1]:
flag =1
else:
i_index += 1
f1.write(flist[i_index])
f1.write("\n")
i_index += 1
f1.close()

Related

Using python and regex to find stings and put them together to replace the filename of a .pdf- the rename fails when using more than one group

I have several thousand pdfs which I need to re-name based on the content. The layouts of the pdfs are inconsistent. To re-name them I need to locate a specific string "MEMBER". I need the value after the string "MEMBER" and the values from the two lines above MEMBER, which are Time and Date values respectively.
So:
STUFF
STUFF
STUFF
DD/MM/YY
HH:MM:SS
MEMBER ######
STUFF
STUFF
STUFF
I have been using regex101.com and have ((.*(\n|\r|\r\n)){2})(MEMBER.\S+) which matches all of the values I need. But it puts them across four groups with group 3 just showing a carriage return.
What I have so far looks like this:
import fitz
from os import DirEntry, curdir, chdir, getcwd, rename
from glob import glob as glob
import re
failed_pdfs = []
count = 0
pdf_regex = r'((.*(\n|\r|\r\n)){2})(MEMBER.\S+)'
text = ""
get_curr = getcwd()
directory = 'PDF_FILES'
chdir(directory)
pdf_list = glob('*.pdf')
for pdf in pdf_list:
with fitz.open(pdf) as pdf_obj:
for page in pdf_obj:
text += page.get_text()
new_file_name = re.search(pdf_regex, text).group().strip().replace(":","").replace("-","") + '.pdf'
text = ""
#clean_new_file_name = new_file_name.translate({(":"): None})
print(new_file_name)
# Tries to rename a pdf. If the filename doesn't already exist
# then rename. If it does exist then throw an error and add to failed list
try:
rename(pdf, new_file_name )
except WindowsError:
count += 1
failed_pdfs.append(str(count) + ' - FAILED TO RENAME: [' + pdf + " ----> " + str(new_file_name) + "]")
If I specify a group in the re.search portion- Like for instance Group 4 which contains the MEMBER ##### value, then the file renames successfully with just that value. Similarly, Group 2 renames with the TIME value. I think the multiple lines are preventing it from using all of the values I need. When I run it with group(), the print value shows as
DATE
TIME
MEMBER ######.pdf
And the log count reflects the failures.
I am very new at this, and stumbling around trying to piece together how things work. Is the issue with how I built the regex or with the re.search portion? Or everything?
I have tried re-doing the Regular Expression, but I end up with multiple lines in the results, which seems connected to the rename failure.
The strategy is to read the page's text by words and sort them adequately.
If we then find "MEMBER", the word following it represents the hashes ####, and the two preceeding words must be date and time respectively.
found = False
for page in pdf_obj:
words = page.get_text("words", sort=True)
# all space-delimited strings on the page, sorted vertically,
# then horizontally
for i, word in enumerate(words):
if word[4] == "MEMBER":
hashes = words[i+1][4] # this must be the word after MEMBER!
time-string = words[i-1][4] # the time
date_string = words[i-2][4] # the date
found = True
break
if found == True: # no need to look in other pages
break

How can I remove double quotations surrounding a list imported from a file in Python?

I'm trying to create a food storage application that tracks the items put in a food storage facility and recalls them for the user. This is a very early prototype of the program, only capable of tracking the information locally, but I ran into some problems with it. So my apologies if this code is unreadable, I just can't find a solution to my problem explained below.
print("Food Storage Application V1.0 - LOCAL ONLY")
UPC_List = []
with open('UPCList.txt', 'r') as file:
for UPCentry in file:
location = UPCentry[:-1]
UPC_List.append(location)
print(UPC_List)
global i
i = 0
UPC_Input = ''
UPC_Count = 0
while True:
UPC_Found = False
UPC_Input = input("Enter UPC or enter 'end' to quit: ")
if UPC_Input == "end":
with open("UPCList.txt", "w") as file:
for UPCsave in UPC_List:
file.write('%s\n' % UPCsave)
break
try:
UPC_Input = int(UPC_Input)
except ValueError as v:
print(f"Input '{UPC_Input}' is not an acceptable UPC")
continue
# print(UPC_List) # for debugging
def newProduct(UPC):
global UPC_Count
product_name = input(f"Enter name of item {UPC}: ")
product_quantity = input(f"Enter quantity of item {UPC}: ")
try:
product_quantity = int(product_quantity)
except ValueError as v:
print("Invalid quantity. Please enter a number.")
newProduct(UPC_Input)
product_unit = input(f"Enter unit type (box, bunch, can, etc...) of item {UPC}: ")
print(f"You have added: \n {product_name} \n {UPC} \n Quantity: {product_quantity} \n Unit: {product_unit}")
UPC_List.insert(UPC_Count, [UPC, product_name, product_quantity, product_unit])
UPC_Count += 1
def existingProduct(UPC):
for sublist in UPC_List:
if str(UPC) in str(sublist):
UPC = int(UPC)
print(f"Position: {UPC_List.index(sublist)} {sublist.index(UPC)}")
position = UPC_List.index(sublist)
addition = input(f"Enter the number of items to add to '{UPC_List[position][1]}' (Default entry: +1): ")
try:
addition = int(addition)
except ValueError as v:
addition = 0
if addition == 0:
UPC_List[position][2] += 1
else:
UPC_List[position][2] += addition
print(f"New Quantity for item '{UPC_List[position][1]}': {UPC_List[position][2]}")
#Find if UPC Exists
for UPC in UPC_List:
if UPC[0] == UPC_Input:
print("UPC Found")
existingProduct(UPC_Input)
UPC_Found = True
if UPC_Found == False:
newProduct(UPC_Input)
This is my code so far. I made a version of it without the read and writing to file lines and it worked great, but I'm stumped on getting the code to read a list from a file and use it in the code. It saves the list, but it won't retrieve it correctly. I found what I think is the problem by using that print(UPC_List) line, which prints ["[2, 'banana', 2, 'bunch']"] (that was a test entry I loaded into the file using the program). I think the problem lies in the double quotes on the outside of the list. This is a nested list, so those quotation marks lead to an index error when I try to access the list.
If this isn't enough info, I can try to provide more. I'm very new to python and coding in general so this was my best attempt at the script.
You are reading each list in as a string.
You can use the python eval function to convert a string into its evaluated form:
my_list_string = "['item1', 'item2']"
my_list = eval(my_list_string)
To reproduce and solve, we need a minimal reproducible example including
the input (file contents posted as plain-text in code-block formatting)
the minimal code to reproduce
Input file
Contents of UPCList.txt:
[2, 'banana', 2, 'bunch']
This is the string-representation of a list in Python (like str(list)). Probably it was written by your program.
Code
a minimal reproducible example (only first few lines to print the read list):
upc_locations = []
with open('UPCList.txt', 'r') as file:
for upc in file:
location = upc[:-1]
upc_locations.append(location)
print(upc_locations)
Prints:
["[2, 'banana', 2, 'bunch']"]
Debug
add some debug print for each line read
upc_locations = []
with open('UPCList.txt', 'r') as file:
for line in file:
print(line)
location = line[:-1] # read all chars from line until last (excluding)
print(location)
upc_locations.append(location)
print(upc_locations)
Prints:
[2, 'banana', 2, 'bunch']
[2, 'banana', 2, 'bunch']
["[2, 'banana', 2, 'bunch']"]
Note:
the second empty-line is the line-break \n at the end of the file's line.
the first line contains something like a Python list with strings and numbers
the third line has removed the line-break.
Fix
The line can be parsed as JSON array. Therefore we need to replace the single-quotes by double-quotes first.
import json
upc_locations = []
with open('UPCList.txt', 'r') as file:
for line in file:
cleaned = line.strip() # remove the line-break and any surrounding whitespace
print(cleaned)
valid_json = cleaned.replace("'", '"') # replace single quotes by double quotes to have valid JSON strings
array = json.loads(valid_json)
print(array)
for element in array:
upc_locations.append(element)
print(upc_locations)
Prints:
[2, 'banana', 2, 'bunch']
[2, u'banana', 2, u'bunch']
[2, u'banana', 2, u'bunch']
Tip: save your objects as JSON to a file
When saving objects from your programs to a plain text-file it is recommended to use a standard format like CSV, XML or JSON.
So you can use standard-parsers to read it (back).
For example:
import json
def save(list):
with open("UPCList.txt", "w") as file:
json.dump(list, file)
# file.write('%s\n' % UPCsave)
def load():
with open("UPCList.txt", "r") as file:
return json.load(file)
Note:
The out-commented line wrote the list in Python's string-representation (see precent-formatting %s). Thus we had to replace the double-quotes when reading.
The json-dump writes a list as JSON-array. This format is readable by many programs and tools and the defacto web-standard.
See also:
Write JSON objects to file using python
Reading JSON from a file?

How to sort a CSV file's column without regard for character casing? [duplicate]

This question already has answers here:
case-insensitive alphabetical sorting of nested lists
(2 answers)
Closed 7 years ago.
I have a program that randomly generates a series of 5 letters (ASCII, both upper and lower case) in column 1 of a CSV and 4 numbers (0-9) in column 2 of the same CSV. I can sort column 2 in order of ascending values but struggle with column 1 as it sorts all the uppercase values first and then lower case. This is also output to a new file, sorted.csv.
Example:
ANcPI
DLBvA
FpSCo
beMhy
dWDjl
Does anyone know how to sort these so that casing does not have an impact but rather just the letter? It should sort as:
ANcPI
beMhy
DLBvA
dWDjl
FpSCo
Here is the code for the program as it currently stands:
import random
import string
#se='111411468' # Hardcoded value for development of the program
se = int(input('Please enter your ID for specific random number sequence: ')) # Asks user for ID number to generate files based on, uses se as above
random.seed(se,2) # Use of 2 in random.seed handles strings, numbers and others
ascii_lowercase = 'abcdefghijklmnopqrstuvwxyz' # Uses all lower case ASCII
ascii_uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' # Uses all upper case ASCII
ascii_letters = ascii_lowercase + ascii_uppercase # Combines all ASCII characters giving a higher degree of randomness to the generated files
digits = '0123456789' # Range of digit values for column 2 of 'unsorted.csv'
def rubbish_string(selection,l):
s = ''
for c in range(l):
s += selection[random.randint(0,len(selection)-1)]
return s
def writeRandomFile(filename):
with open(filename, 'w') as fout: # With automatically closes the file on finishing its extent,even with exceptions
fout.write('Personal ID,'+str(se)+'\n') # Write and assembled string, personal ID in grid A1 and personal ID No. in grid B1.....headings for the data
for xc in range(random.randint(10,20)):
fout.write(rubbish_string(ascii_letters,5)+','+rubbish_string(digits,4)+'\n') # Assemble and write a line to the file using 5 from the ascii_letters and 4 from the digits variable
def selectionSort (alist, col): # Slightly modified function for selection sort from part Q7A
for fillslot in range(len(alist)-1,0,-1):
positionOfMax=0 # Initally the maximum value is positioned at 0, Binary so position 1
for location in range(1, fillslot+1):
# Results in a list of size 2, first part is letters, second part is numbers
line1 = alist[location].split(",") # Makes sense to use "," for a .csv
line2 = alist[positionOfMax].split(",")
# Column-1 because computers count from zero (Binary)
# When the user enters 1 or 2, the computer deals with it in terms of 0 or 1
if line1[col - 1] > line2[col - 1]: # Deals with the Binary issue by taking 1 from the input column value from the user
positionOfMax=location
temp= alist[fillslot]
alist[fillslot]=alist[positionOfMax]
alist[positionOfMax]=temp
# Main part...
# Create random file based on the user data
writeRandomFile('unsorted.csv')
# Allow user pick which column to sort on, either 1 or 2 (could be adapted for more columns)
sortOnColumn = -1
while sortOnColumn != 1 and sortOnColumn != 2: # If the user does not enter an appropriate value, in this case 1 or 2 they are re-asked to input the column based on which ths sorting will be done.
sortOnColumn = int(input("Which column should the files be sorted on?"))
# Open unsorted file and load into a list
fin = open('unsorted.csv', 'r') # Opens a file for reading, called fin
data = [] # Creates an empty list named data
header = next(fin) # Skip first line because its Personal ID data, not random data generated by the program
for line in fin: # Loops through lines in fin
data.append(line.strip()) # Adds the line to the list, appending the list[]
selectionSort(data, sortOnColumn) # Sort list with the selection sort algorithm, calling the function, where data=the list and sortOnColum=user choice
# Write the now sorted list to a file, called fout
fout = open('sorted.csv', 'w') # Opening the empty sort.csv in write mode
fout.write(header) # Write PersonID data first at the top of the .csv as in the unsorted format
for entry in data: # Write ordered data
fout.write(entry)
data.sort(key=lambda m : m.lower()) # Sorts col.1 based on non case sensitive letters but issues with col.2..............
fout.write("\n") # Formating with "\n"
fout.close() # Close the file so not to have generated just a blank .csv with no data
print('Your Files Have Been Generated, Goodbye!')
You have a lot of errors in your code as shown on PEP8 online. If you can, use either PyCharm Community or PyCharm Edu in the future so that your code is automatically checked while writing it. Here is a revised version of your code:
import csv
import random
import string
import sys
UNSORTED_FILENAME = 'unsorted.csv'
SORTED_FILENAME = 'sorted.csv'
def main():
"""This function handles execution of the entire program."""
# seed = setup_random_sequence(111411468)
seed = setup_random_sequence()
write_random_file(UNSORTED_FILENAME, seed)
header, data = load_and_sort(UNSORTED_FILENAME)
write_csv_file(SORTED_FILENAME, header, data)
print('Your files have been generated. Goodbye!')
def setup_random_sequence(seed=None):
"""Seed the random number generator with some specific value."""
if seed is None:
seed = get_int('Enter your ID for your random number sequence: ')
random.seed(seed, 2)
return seed
def get_int(prompt='>>> '):
"""Get an integer from the user and return it to the function's caller."""
while True:
try:
text = input(prompt)
except EOFError:
sys.exit()
else:
try:
return int(text)
except ValueError:
print('You must enter an integer.')
def write_random_file(filename, seed):
"""Create random file based on the user data."""
rows = ((rubbish_string(string.ascii_letters, 5),
rubbish_string(string.digits, 4))
for _ in range(random.randint(10, 20)))
write_csv_file(filename, ('Personal ID', seed), rows)
def write_csv_file(filename, header, data):
"""Write the now sorted list to a file."""
with open(filename, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(header)
writer.writerows(data)
def rubbish_string(selection, length):
"""Create a string of given length built of characters from selection."""
return ''.join(random.choice(selection) for _ in range(length))
def load_and_sort(filename):
"""Load the file given by filename and sort by user specification."""
sort_on_column = None
while True:
sort_on_column = get_int('Which column should file be sorted on? ')
if sort_on_column in {1, 2}:
break
print('Column number is out of range.')
with open(filename, newline='') as file:
reader = csv.reader(file)
header = next(reader)
data = list(reader)
selection_sort(data, sort_on_column)
return header, data
def selection_sort(array, column):
"""This is a modified function for selection sort from part Q7A."""
array.sort(key=lambda row: row[column - 1].casefold())
if __name__ == '__main__':
main()
As you might notice on PEP8 online, the code no longer has any errors. If you need any further changes, please let me know.

Unable to create a dictionary

I am trying to write a script for log parsing.
I got a file in which logs are jumbled up. Every first line of a particular log will have time stamp so I want to sort them using that.
For e.g.
10:48 Start
.
.
10:50 start
.
.
10:42 start
First line will contain key word ‘Start’ and the time stamp. The lines between ‘Start’ and before next ‘start’ are one set. I want to sort all of these sets in log files based on their time stamp.
Code Logic:
I thought of creating dictionary, where I will pick this time and assign it as ‘key’ and the text in value for that log set. And then I will sort the ‘keys’ in dictionary and print their ‘values’ in that sorted order in a file.
However I am getting error “TypeError: unhashable type: 'list'”
write1 = False
x = 0
search3 = "start"
matched = dict()
matched = {}
# fo is a list which is defined elsewhre in the code.
for line in fo:
if search3 in line:
#got the Hello2 printed which indicates script enters this loop
print('hello2')
write1 = True
x +=1
time = line.split()[3]
name1 = [time.split(':')[0] +":"+time.split(':')[1] + ":"+ time.split(':')[2]]
matched[name1] = line
elif write1:
matched[name1] += line
print(matched.keys())
Please let me know if my logic and the way I am doing is correct?
You set name1 as a list. Lists aren't hashable, only tuples are. However, I assume that you want name1 to be a string so you just want to remove the brackets:
name1 = time.split(':')[0] +":"+time.split(':')[1] + ":"+ time.split(':')[2]

Finding a matching keyword from a text file using python 3.x

I have a text file with following format and many such lines
Main keyword:Mainvariable, variable (1):name_1, variable (2, 3):name_2, variable(3, 1, 2):name_3, and so on....
I want to retrieve after checking the Main keyword exist in the particular line and then get the corresponding information.
For example if I have variable (1) in a particular line in another file. I want to get the answer as name_1
Input file
variable (1)
variable (3,1,2)
variable (1) as so on...
Required output:
name_1
name_3
name_1 and so on..
code till now:
print("\n-- ACCESSED VARIABLES --\n")
with open(commands_file, "r") as find_feature_access:
with open("commands_analysis/feature_commands", "a") as feature_commands:
with open("common_files/commands_library", "r") as feature_names:
for line_1 in find_feature_access:
current_line = line_1.strip()
if current_line in possible_variables:
feature_commands.write(current_line + "\n")
for line_2 in feature_names:
if current_line in line_2:
line_part = line_2.strip(",")
if current_line in line_part:
required_part = line_part.strip(",")
print (required_part)
Something like this?
sFILE = 'Main keyword:Mainvariable, variable (1):name_1, variable (2, 3):name_2, variable(3, 1, 2):name_3'
# I'm removing the beginning, but you could extract it if you want to check it
# ie "Main keyword:Mainvariable,"
sLINE = sFILE[26:]
# split the rest of the line at "variable"
sVARS = sLINE.split('variable')
for item in sVARS:
# clean up whitespaces and commas
item = item.strip(' ,')
print('"',item,'"')
# prints "(1):name_1"
# prints "(2, 3):name_2"
# prints "(3, 1, 2):name_3"
Now you can extract the name_n or do some other necessary operation or do the same thing in your other file and cross check? eg populate the names in a list (or dic depending on your need) and then in the other file extract the "1" "2" and "3" and look up the list/dictionary?

Categories