How to parse multiple line catalina log in python - regex - python

I have catalina log:
oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated
WARNING: HTTP Session created without LoggedInSessionBean
oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException
at ais.api.rest.rdss.Resource.lookAT(Resource.java:22)
at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
I try to parse it in python. My problem is that I dont know how many lines there are in log. Minimum are 2 lines. I try read from file and when first line start with j,m,s,o etc. it mean it is first line of log, because this are first letters of months. But I dont know how to continue. When I stop read the lines ? When next line will starts with one of these letters ? But how I do that?
import datetime
import re
SPACE = r'\s'
TIME = r'(?P<time>.*?M)'
PATH = r'(?P<path>.*?\S)'
METHOD = r'(?P<method>.*?\S)'
REQUEST = r'(?P<request>.*)'
TYPE = r'(?P<type>.*?\:)'
REGEX = TIME+SPACE+PATH+SPACE+METHOD+SPACE+TYPE+SPACE+REQUEST
def parser(log_line):
match = re.search(REGEX,log_line)
return ( (match.group('time'),
match.group('path'),
match.group('method'),
match.group('type'),
match.group('request')
)
)
db = MySQLdb.connect(host="localhost", user="myuser", passwd="mypsswd", db="Database")
with db:
cursor = db.cursor()
with open("Mylog.log","rw") as f:
for line in f:
if (line.startswith('j')) or (line.startswith('f')) or (line.startswith('m')) or (line.startswith('a')) or (line.startswith('s')) or (line.startswith('o')) or (line.startswith('n')) or (line.startswith('d')) :
logLine = line
result = parser(logLine)
sql = ("INSERT INTO ..... ")
data = (result[0])
cursor.execute(sql, data)
f.close()
db.close()
Best idea I have is read just two lines at a time. But that means discard all another data. There must be better way.
I want read lines like this:
1.line - oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean
2.line - oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException SEVERE: Mapped exception to response: 500 (Internal Server Error) javax.ws.rs.WebApplicationException at ais.api.rest.rdss.Resource.lookAT(Resource.java:22) at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl java:43)
3.line - oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean
So I want start read when line starts with datetime (this is no problem). Problem is that I want stop read when next line starts with datetime.

This may be what you want.
I read lines from the log inside a generator so that I can determine whether they are datetime lines or other lines. Also, importantly, I can flag that end-of-file has been reached in the log file.
In the main loop of the program I start accumulating lines in a list when I get a datetime line. The first time I see a datetime line I print it out if it's not empty. Since the program will have accumulated a complete line when end-of-file occurs I arrange to print the accumulated line at that point too.
import re
a_date, other, EOF = 0,1,2
def One_line():
with open('caroline.txt') as caroline:
for line in caroline:
line = line.strip()
m = re.match(r'[a-z]{3}\s+[0-9]{1,2},\s+[0-9]{4}\s+[0-9]{1,2}:[0-9]{2}:[0-9]{2}\s+[AP]M', line, re.I)
if m:
yield a_date, line
else:
yield other, line
yield EOF, ''
complete_line = []
for kind, content in One_line():
if kind in [a_date, EOF]:
if complete_line:
print (' '.join(complete_line ))
complete_line = [content]
else:
complete_line.append(content)
Output:
oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean
oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException SEVERE: Mapped exception to response: 500 (Internal Server Error) javax.ws.rs.WebApplicationException at ais.api.rest.rdss.Resource.lookAT(Resource.java:22) at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Related

How to read lines with different format(exceptions) in Lambda function (python code) which reads from S3 to ElK?

Example Format of my Log file
-------------------------------------------------------------Start of Log file Format----------------------------------------
2020-08-14 05:35:48.752 - [INFO] - from [Class:com.webservices.services.impl.DataImpl Method:bData] in 20 - Data Single completed in 1047 mili sec
2020-08-14 05:35:48.752 - [INFO] - from [Class:com.webservices.services.impl.DataImpl Method:bData] in 20 - Data Single completed in 1099 mili sec
2020-08-14 05:35:48.762 - [ERROR] - from [Class:com.webservices.helper.THelper Method:lambda$0] in 20 - Parsing Response from server Failed
org.json.JSONException: No value for dataModel
at org.json.JSONObject.get(JSONObject.java:355) ~[android-json-0.0.20131108.vaadin1.jar:0.0.20131108.vaadin1]
-----------------------------------------------------------------------End of Log file format ------------------------------
I am currently handling data (with bad coding practice) which starts with a date in the first 3 lines of the log above.
The problem starts with 4th line onward where I have exceptions in my log file with no date. Actually The previous line has the date and its the continuation of the Exception.
I do not know how to handle those lines as the format changes. I wanted to get the date of the previous line for the exception line.
Either I have to keep the previous line's date in a temporary variable and use it if there is format change or any other way.
I need Date as mandatory to push in ELK which takes the timestamp for the log.
Also if there is a suggestion for clean code please do it. I need the lambda to run faster.
Currently this is my logic in python code (I am a complete beginner in Python code).
import boto3
import botocore
import re
import json
import requests
...
...
def handler(event, context):
print("Starting handler function")
for record in event['Records']:
# Get the bucket name and key for the new file
bucket = record['s3']['bucket']['name']
print("Bucket name - " + bucket)
print(date_time)
key = record['s3']['object']['key']
print("key name - " + key)
# Get, read, and split the file into lines
obj = s3.get_object(Bucket=bucket, Key=key)
body = obj['Body'].read()
lines = body.splitlines()
# Match the regular expressions to each line and index the JSON
for line in lines:
document = {}
try:
if line[0] == '2':
listOfData = line.split(" - ")
date = datetime.strptime(listOfData[0], "%Y-%m-%d %H:%M:%S.%f")
timestamp = date.isoformat()
level = listOfData[1]
classname = listOfData[2]
message = listOfData[-1]
document = { "timestamps": timestamp, "level": level, "classnames": classname, "messages": message }
print(document)
else:
document = {"messages": line}
except ClientError as e:
raise e
r = requests.post(url, auth=awsauth, json=document, headers=headers)
print(r)
Additional Info :
As suggested in the answer by Ben, when i print this_line , i get all lines in the log file printed properly in separate lines but there is a problem for the exception lines in between. below is what printed.
{'line content': ['2020-08-14 05:35:48.762 - [ERROR] - from [Class:com.webevics.helpr.Helpr Method:lamda] in 2 - Parsng Respns from servr Failed', 'org.jsn.JSONExcepion: No vlue for dataModel'], 'date string': '2020-08-14 05:35:47.655'}
{'line content': ['2020-08-14 05:35:48.762 - [ERROR] - from [Class:com.webserics.helpr.Helpr Method:lambda] in 2 - Parsing Respnse from servr Faied', 'org.jsn.JSONException: No vlue for dataModel', 'at org.json.JSONObject.get(JSONObject.java:355) ~[android-json-0.0.20131108.vaadin1.jar:0.0.20131108.vaadin1]'], 'date string': '2020-08-14 05:35:47.655'}
Here i am getting 2 lines printed where the first line is useless and 2nd line is better . So can we make something like the 1st line should not come and only the 2nd line is present in this_line ?
I started by placing your sample data in a file,
2020-08-14 05:35:48.752 - [INFO] - from [Class:com.webservices.services.impl.DataImpl Method:bData] in 20 - Data Single completed in 1047 mili sec
2020-08-14 05:35:48.752 - [INFO] - from [Class:com.webservices.services.impl.DataImpl Method:bData] in 20 - Data Single completed in 1099 mili sec
2020-08-14 05:35:48.762 - [ERROR] - from [Class:com.webservices.helper.THelper Method:lambda$0] in 20 - Parsing Response from server Failed
org.json.JSONException: No value for dataModel
at org.json.JSONObject.get(JSONObject.java:355) ~[android-json-0.0.20131108.vaadin1.jar:0.0.20131108.vaadin1]
2020-08-14 05:35:48.752 - [INFO] - from [Class:com.webservices.services.impl.DataImpl Method:bData] in 20 - Data Single completed in 1099 mili sec
I've added a fourth entry just to verify my approach works. The Python below detects the start of a line using a regular expression tailored to the date format
import re
with open('example.txt','r') as file_handle:
file_content = file_handle.read().split("\n")
list_of_lines = []
this_line = {}
for index, line in enumerate(file_content):
if len(line.strip())>0:
reslt = re.match('\d{4}-\d\d-\d\d \d\d:\d\d:\d\d\.\d{3}', line)
if reslt: # line starts with a date
if len(this_line.keys())>0:
list_of_lines.append(this_line)
date_string = reslt.group(0)
this_line = {'date string':date_string, 'line content': []}
this_line['line content'].append(line)
else:
this_line['line content'].append(line)
The data structure produced, list_of_lines, is a list of dictionaries. Each dictionary is an error entry which may contain one or more lines. The two keys in the error dictionary are date string and line content.
To review the output data structure, try running
for line in list_of_lines[2]['line content']:
print(line)

Python scripting with ete3 to query NCBI's Taxonomy: "sqlite3 Warning (can only execute one statement at a time)"

I am using this script:
import csv
import time
import sys
from ete3 import NCBITaxa
ncbi = NCBITaxa()
def get_desired_ranks(taxid, desired_ranks):
lineage = ncbi.get_lineage(taxid)
names = ncbi.get_taxid_translator(lineage)
lineage2ranks = ncbi.get_rank(names)
ranks2lineage = dict((rank,taxid) for (taxid, rank) in lineage2ranks.items())
return{'{}_id'.format(rank): ranks2lineage.get(rank, '<not present>') for rank in desired_ranks}
if __name__ == '__main__':
file = open(sys.argv[1], "r")
taxids = []
contigs = []
for line in file:
line = line.split("\n")[0]
taxids.append(line.split(",")[0])
contigs.append(line.split(",")[1])
desired_ranks = ['superkingdom', 'phylum']
results = list()
for taxid in taxids:
results.append(list())
results[-1].append(str(taxid))
ranks = get_desired_ranks(taxid, desired_ranks)
for key, rank in ranks.items():
if rank != '<not present>':
results[-1].append(list(ncbi.get_taxid_translator([rank]).values())[0])
else:
results[-1].append(rank)
i = 0
for result in results:
print(contigs[i] + ','),
print(','.join(result))
i += 1
file.close()
The script takes taxids from a file and fetches their respective lineages from a local copy of NCBI's Taxonomy database. Strangely, this script works fine when I run it on small sets of taxids (~70, ~100), but most of my datasets are upwards of 280k taxids and these break the script.
I get this complete error:
Traceback (most recent call last):
File "/data1/lstout/blast/scripts/getLineageByETE3.py", line 31, in <module>
ranks = get_desired_ranks(taxid, desired_ranks)
File "/data1/lstout/blast/scripts/getLineageByETE3.py", line 11, in get_desired_ranks
lineage = ncbi.get_lineage(taxid)
File "/data1/lstout/.local/lib/python2.7/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 227, in get_lineage
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %taxid)
sqlite3.Warning: You can only execute one statement at a time.
The first two files from the traceback are simply the script I referenced above, the third file is one of ete3's. And as I stated, the script works fine with small datasets.
What I have tried:
Importing the time module and sleeping for a few milliseconds/hundredths of a second before/after my offending lines of code on lines 11 and 31. No effect.
Went to line 227 in ete3's code...
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %merged_conversion[taxid])
and changed the "execute" function to "executescript" in order to be able to handle multiple queries at once (as that seems to be the problem). This produced a new error and led to a rabbit hole of me changing minor things in their script trying to fudge this to work. No result. This is the complete offending function:
def get_lineage(self, taxid):
"""Given a valid taxid number, return its corresponding lineage track as a
hierarchically sorted list of parent taxids.
"""
if not taxid:
return None
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %taxid)
raw_track = result.fetchone()
if not raw_track:
#perhaps is an obsolete taxid
_, merged_conversion = self._translate_merged([taxid])
if taxid in merged_conversion:
result = self.db.execute('SELECT track FROM species WHERE taxid=%s' %merged_conversion[taxid])
raw_track = result.fetchone()
# if not raise error
if not raw_track:
#raw_track = ["1"]
raise ValueError("%s taxid not found" %taxid)
else:
warnings.warn("taxid %s was translated into %s" %(taxid, merged_conversion[taxid]))
track = list(map(int, raw_track[0].split(",")))
return list(reversed(track))
What bothers me so much is that this works on small amounts of data! I'm running these scripts from my school's high performance computer and have tried running on their head node and in an interactive moab scheduler. Nothing has helped.

python question regarding re.search and indexing

Im new to python and coding and im trying to understand how re.search and indexing works.
below is what I have so far and I want physical_state to equal what comes after the Completion: (in this case it is success) but I dont really understand how the match.group(1) and re.search works.
import ops # Import the OPS module.
import sys # Import the sys module.
import re
# Subscription processing function
def ops_condition (o):
status, err_str = o.timer.relative("tag",10)
return status
def ops_execute (o):
handle, err_desp = o.cli.open()
print("OPS opens the process of command:",err_desp)
result, n11, n21 = o.cli.execute(handle,"return")
result, n11, n21 = o.cli.execute(handle,"dis nqa results test-instance sla 1 | i Completion:")
match = re.search(r"Completion:", result)
if not match:
print("Could not determine the state.")
return 0 # Look into what the return values mean.
physical_state = match.group(1) # Gets the first group from the match.
print (physical_state)
result = o.cli.close(handle)
return 0
output of result, n11, n21 = o.cli.execute(handle,"dis nqa results test-instance sla 1 | i Completion:")
Completion:success RTD OverThresholds number: 0
Completion:success RTD OverThresholds number: 0
Completion:success RTD OverThresholds number: 0
Completion:success RTD OverThresholds number: 0
Completion:success RTD OverThresholds number: 0
error when run
<setup>('OPS opens the process of command:', 'success')
Oct 18 2018 06:12:57+00:00 setup %%01OPSA/3/OPS_RESULT_EXCEPTION(l)[410]:Script is test3.py, current event is tag, instance is 1381216156, exception reason is Traceback (most recent call last):
File ".lib/frame.py", line 114, in <module>
ret = m.ops_execute(o)
File "flash:$_user/test3.py", line 21, in ops_execute
physical_state = match.group(1) # Gets the first group from the match.
IndexError: no such group
Thanks in advance

Python TypeError: '_sre.SRE_Match' object has no attribute '__getitem__'

Im brand new to python and coding, im trying to get below working.
this is test code and if I can get this working I should be able to build on it.
Like I said im new to this so sorry if its a silly mistake.
# coding=utf-8
import ops # Import the OPS module.
import sys # Import the sys module.
import re
# Subscription processing function
def ops_condition (o):
enter code herestatus, err_str = o.timer.relative("tag",10)
return status
def ops_execute (o):
handle, err_desp = o.cli.open()
print("OPS opens the process of command:",err_desp)
result, n11, n21 = o.cli.execute(handle,"return")
result, n11, n21 = o.cli.execute(handle,"display interface brief | include Ethernet0/0/1")
match = re.search(r"Ethernet0/0/1\s*(\S+)\s*", result)
if not match:
print("Could not determine the state.")
return 0
physical_state = match[1] # Gets the first group from the match.
print (physical_state)
if physical_state == "down":
print("down")
result = o.cli.close(handle)
else :
print("up")
return 0
Error
<setup>('OPS opens the process of command:', 'success')
Oct 17 2018 11:53:39+00:00 setup %%01OPSA/3/OPS_RESULT_EXCEPTION(l)[4]:Script is test.py, current event is tag, instance is 1515334652, exception reason is Trac eback (most recent call last):
File ".lib/frame.py", line 114, in <module>
ret = m.ops_execute(o)
File "flash:$_user/test.py", line 22, in ops_execute
physical_state = match[1] # Gets the first group from the match.
TypeError: '_sre.SRE_Match' object has no attribute '__getitem__'
The __getitem__ method for the regex match objects was only added since Python 3.6. If you're using an earlier version, you can use the group method instead.
Change:
physical_state = match[1]
to:
physical_state = match.group(1)
Please refer to the documentation for details.

Separate overlapping polygons into regions

I cam across the following post under Stack Overflow: Exploding overlapping polygons
I downloaded the source code that was posted by the initial author of the post and made adjustments trying to get it to work, but I'm currently receiving the following error message and not sure how to resolve it, please be advised that I'm still learning to code, so I'm lacking fundamental theory.
Error Message: Text
Executing: OverlapReg
E:\Projects\2015\H111225_6\ArcHydro\27Jan15\01SouthNorthAlign\OverlappingWatershedsAnalysis.gdb\Watershed
HydroID2 Start Time: Wed Mar 11 14:58:32 2015 Running script
OverlapReg... Failed script OverlapReg...
Traceback (most recent call last): File
"E:\Python\Masters\Scripts\OverlappingRegions\OverlappingRegions.py",
line 59, in
countOverlaps(fc,idName) File "E:\Python\Masters\Scripts\OverlappingRegions\OverlappingRegions.py",
line 58, in countOverlaps
urows.updateRow(urow) File "c:\program files (x86)\arcgis\desktop10.2\arcpy\arcpy\arcobjects\arcobjects.py", line
102, in updateRow
return convertArcObjectToPythonObject(self._arc_object.UpdateRow(*gp_fixargs(args)))
RuntimeError: ERROR 999999: Error executing function. The row contains
a bad value. [Watershed] The row contains a bad value. [overlaps]
Failed to execute (OverlapReg). Failed at Wed Mar 11 14:58:35 2015
(Elapsed Time: 2.45 seconds)
I'm trying to assign id's to my Watershed Feature Class based on the following code to be able to split my Watershed Feature Class into the least amount of separate feature classes where the Watersheds don't overlap each other, as I need to export them into a AutoCAD drawing where there are not overlapping features within a single layer.
import os
import arcpy
from arcpy import GetParameterAsText
fc = GetParameterAsText(0)
idName = GetParameterAsText(1)
dirname = os.path.dirname(arcpy.Describe(fc).catalogPath)
desc = arcpy.Describe(dirname)
if hasattr(desc, "datasetType") and desc.datasetType=='FeatureDataset':
dirname = os.path.dirname(dirname)
arcpy.env.workspace = dirname
def countOverlaps(fc,idName):
intersect = arcpy.Intersect_analysis(fc,'intersect')
findID = arcpy.FindIdentical_management(intersect,"explFindID","Shape")
arcpy.MakeFeatureLayer_management(intersect,"intlyr")
arcpy.AddJoin_management("intlyr",arcpy.Describe("intlyr").OIDfieldName,findID,"IN_FID","KEEP_ALL")
segIDs = {}
featseqName = "explFindID.FEAT_SEQ"
idNewName = "intersect."+idName
for row in arcpy.SearchCursor("intlyr"):
idVal = row.getValue(idNewName)
featseqVal = row.getValue(featseqName)
segIDs[featseqVal] = []
for row in arcpy.SearchCursor("intlyr"):
idVal = row.getValue(idNewName)
featseqVal = row.getValue(featseqName)
segIDs[featseqVal].append(idVal)
segIDs2 = {}
for row in arcpy.SearchCursor("intlyr"):
idVal = row.getValue(idNewName)
segIDs2[idVal] = []
for x,y in segIDs.iteritems():
for segID in y:
segIDs2[segID].extend([k for k in y if k != segID])
for x,y in segIDs2.iteritems():
segIDs2[x] = list(set(y))
arcpy.RemoveJoin_management("intlyr",arcpy.Describe(findID).name)
if 'overlaps' not in [k.name for k in arcpy.ListFields(fc)]:
arcpy.AddField_management(fc,'overlaps',"TEXT")
if 'ovlpCount' not in [k.name for k in arcpy.ListFields(fc)]:
arcpy.AddField_management(fc,'ovlpCount',"SHORT")
urows = arcpy.UpdateCursor(fc)
for urow in urows:
idVal = urow.getValue(idName)
if segIDs2.get(idVal):
urow.overlaps = str(segIDs2[idVal]).strip('[]')
urow.ovlpCount = len(segIDs2[idVal])
urows.updateRow(urow)
countOverlaps(fc,idName)
def explodeOverlaps(fc,idName):
countOverlaps(fc,idName)
arcpy.AddField_management(fc,'expl',"SHORT")
urows = arcpy.UpdateCursor(fc,'"overlaps" IS NULL')
for urow in urows:
urow.expl = 1
urows.updateRow(urow)
i=1
lyr = arcpy.MakeFeatureLayer_management(fc)
while int(arcpy.GetCount_management(arcpy.SelectLayerByAttribute_management(lyr,"NEW_SELECTION",'"expl" IS NULL')).getOutput(0)) > 0:
ovList=[]
urows = arcpy.UpdateCursor(fc,'"expl" IS NULL','','','ovlpCount D')
for urow in urows:
ovVal = urow.overlaps
idVal = urow.getValue(idName)
intList = ovVal.replace(' ','').split(',')
for x in intList:
intList[intList.index(x)] = int(x)
if idVal not in ovList:
urow.expl = i
urows.updateRow(urow)
ovList.extend(intList)
i+=1
explodeOverlaps(fc,idName)
Any assistance in how to resolve the following will truly be appreciated.
The clues are in the errors.
the row contains a bad value [Watershed]
the row contains a bad value [overlaps]
This is likely cause by trying to insert a value into the field overlaps, but due to something with the field properties like the length is 4 and your value is "long string", it therefore is too big to be inserted.
ESRI
GIS Stack Exchange

Categories