Merging and sorting multiline logs in Python - python

I have a bunch of log files with the following format:
[Timestamp1] Text1
Text2
Text3
[Timestamp2] Text4
Text5
...
...
Where the number of text lines following a timestamp can vary from 0 to many. All the lines following a timestamp until the next timestamp are part of the previous log statement.
Example:
[2016-03-05T23:18:23.672Z] Some log text
[2016-03-05T23:18:23.672Z] Some other log text
[2016-03-05T23:18:23.672Z] Yet another log text
Some text
Some text
Some text
Some text
[2016-03-05T23:18:23.672Z] Log text
Log text
I am trying to create a log merge script for such types of log files and have been unsuccessful so far.
If the logs were in a standard format where each line is a separate log entry, it is straight forward to create a log merge script using fileinput and sorting.
I think am looking at a way to treat multiple lines as a single log entity that is sortable on the associated timestamp.
Any pointers?

You can write a generator that acts as an adapter for your log stream to do the chunking for you. Something like this:
def log_chunker(log_lines):
batch = []
for line in log_lines:
if batch and has_timestamp(line):
# detected a new log statement, so yield the previous one
yield batch
batch = []
batch.append(line)
yield batch
This will turn your raw log lines into batches where each one is a list of lines, and the first line in each list has the timestamp. You can build the rest from there. It might make more sense to start batch as an empty string and tack on the rest of the message directly; whatever works for you.
Side-note, if you're merging multiple timestamped logs you shouldn't need to perform global sorting at all if you use a streaming merge-sort.

The following approach should work well.
from heapq import merge
from itertools import groupby
import re
import glob
re_timestamp = re.compile(r'\[\d{4}-\d{2}-\d{2}')
def get_log_entry(f):
entry = ''
for timestamp, g in groupby(f, lambda x: re_timestamp.match(x) is not None):
entries = [row.strip() + '\n' for row in g]
if timestamp:
if len(entries) > 1:
for entry in entries[:-1]:
yield entry
entry = entries[-1]
else:
yield entry + ''.join(entries)
files = [open(f) for f in glob.glob('*.log')] # Open all log files
with open('output.txt', 'w') as f_output:
for entry in merge(*[get_log_entry(f) for f in files]):
f_output.write(''.join(entry))
for f in files:
f.close()
It makes use of the merge function to combine a list of iterables in order.
As your timestamps are naturally ordered, all that is needed is a function to read whole entries at a time from each file. This is done using a regular expression to spot lines starting in each file with a timestamp, and groupby is used to read matching rows in at once.
glob is used to first find all files in your folder with a .log extension.

You can easily break it into chunks using re.split() with a capturing regexp:
pieces = re.split(r"(^\[20\d\d-.*?\])", logtext, flags=re.M)
You can make the regexp as precise as you wish; I just require [20\d\d- at the start of a line. The result contains the matching and non-matching parts of logtext, as alternating pieces (starting with an empty non-matching part).
>>> print(pieces[:5])
['', '[2016-03-05T23:18:23.672Z] ', 'Some log text\n', '[2016-03-05T23:18:23.672Z] ', 'Some other log text\n']
It remains to reassemble the log parts, which you can do with this recipe from itertools:
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(iterable)
next(b, None)
return zip(a, b)
log_entries = list( "".join(pair) for pair in pairwise(pieces[1:]) )
If you have several of these lists, you can indeed just combine and sort them, or use a fancier merge sort if you have lots of data. I understand your question to be about splitting up the log entries, so I won't go into this.

Related

Extract time values from a list and add to a new list or array

I have a script that reads through a log file that contains hundreds of these logs, and looks for the ones that have a "On, Off, or Switch" type. Then I output each log into its own list. I'm trying to find a way to extract the Out and In times into a separate list/array and then subtract the two times to find the duration of each separate log. This is what the outputted logs look like:
['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
This is my current code:
logfile = '/path/to/my/logfile'
with open(logfile, 'r') as f:
text = f.read()
words = ["On", "Off", "Switch"]
text2 = text.split('\n')
for l in text.split('\n'):
if (words[0] in l or words[1] in l or words[2] in l):
log = l.split(',')[0:3]
I'm stuck on how to target only the Out and In time values from the logs and put them in an array and convert to a time value to find duration.
Initial log before script: everything after the "In" time is useless for what I'm looking for so I only have the first three indices outputted
2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a","Type":"Switch,"In":"2020-01-31T00:30:20.140Z","Path":"interface","message":"interface changed status from unknown to normal","severity":"INFORMATIONAL","display":true,"json_map":"{\"severity\":null,\"eventId\":\"65e-64d9-45-ab62-8ef98ac5e60d\",\"componentPath\":\"interface_css\",\"displayToGui\":false,\"originalState\":\"unknown\",\"closed\":false,\"eventType\":\"InterfaceStateChange\",\"time\":\"2019-04-18T07:04:32.747Z\",\"json_map\":null,\"message\":\"interface_css changed status from unknown to normal\",\"newState\":\"normal\",\"info\":\"Event created with current status\"}","closed":false,"info":"Event created with current status","originalState":"unknown","newState":"normal"}
Below is a possible solution. The wordmatch line is a bit of a hack, until I find something clearer: it's just a one-liner that create an empty or 1-element set of True if one of the words matches.
(Untested)
import re
logfile = '/path/to/my/logfile'
words = ["On", "Off", "Switch"]
dateformat = r'\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[Zz]?'
pattern = fr'Out:\s*\[(?P<out>{dateformat})\].*In":\s*\"(?P<in>{dateformat})\"'
regex = re.compile(pattern)
with open(logfile, 'r') as f:
for line in f:
wordmatch = set(filter(None, (word in s for word in words)))
if wordmatch:
match = regex.search(line)
if match:
intime = match.group('in')
outtime = match.group('out')
# whatever to store these strings, e.g., append to list or insert in a dict.
As noted, your log example is very awkward, so this works for the example line, but may not work for every line. Adjust as necessary.
I have also not included (if so wanted), a conversion to a datetime.datetime object. For that, read through the datetime module documentation, in particular datetime.strptime. (Alternatively, you may want to store your results in a Pandas table. In that case, read through the Pandas documentation on how to convert strings to actual datetime objects.)
You also don't need to read nad split on newlines yourself: for line in f will do that for you (provided f is indeed a filehandle).
Regex is probably the way to go (fastness, efficiency etc.) ... but ...
You could take a very simplistic (if very inefficient) approach of cleaning your data:
join all of it into a string
replace things that hinder easy parsing
split wisely and filter the split
like so:
data = ['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
all_text = " ".join(data)
# this is inefficient and will create throwaway intermediate strings - if you are
# in a hurry or operate on 100s of MB of data, this is NOT the way to go, unless
# you have time
# iterate pairs of ("bad thing", "what to replace it with") (or list of bad things)
for thing in [ (": ",":"), (list('[]{}"'),"") ]:
whatt = thing[0]
withh = thing[1]
# if list, do so for each bad thing
if isinstance(whatt, list):
for p in whatt:
# replace it
all_text = all_text.replace(p,withh)
else:
all_text = all_text.replace(whatt,withh)
# format is now far better suited to splitting/filtering
cleaned = [a for a in all_text.split(" ")
if any(a.startswith(prefix) or "Switch" in a
for prefix in {"In:","Switch:","Out:"})]
print(cleaned)
Outputs:
['Out:2020-01-31T00:30:20.150Z', 'Type:Switch', 'In:2020-01-31T00:30:20.140Z']
After cleaning your data would look like:
2020-01-31T12:04:57.976Z 1234 Out:2020-01-31T00:30:20.150Z Id:Id:4-f-4-9-6a Type:Switch In:2020-01-31T00:30:20.140Z
You can transform the clean list into a dictionary for ease of lookup:
d = dict( part.split(":",1) for part in cleaned)
print(d)
will produce:
{'In': '2020-01-31T00:30:20.140Z',
'Type': 'Switch',
'Out': '2020-01-31T00:30:20.150Z'}
You can use datetime module to parse the times from your values as shown in 0 0 post.

What is the most effective way to compare strings using python in two very large files?

I have two large text files with about 10k lines in each. Each line has a unique string in the same position that needs to be compared with all the other strings in the other file to see if it matches and if not, print it out. I'm not sure how to do this in a way that makes sense time wise since the files are so large. Heres an example of the files.
File 1:
https://www.exploit-db.com/exploits/10185/
https://www.exploit-db.com/exploits/10189/
https://www.exploit-db.com/exploits/10220/
https://www.exploit-db.com/exploits/10217/
https://www.exploit-db.com/exploits/10218/
https://www.exploit-db.com/exploits/10219/
https://www.exploit-db.com/exploits/10216/
file 2:
EXPLOIT:10201 CVE-2009-4781
EXPLOIT:10216 CVE-2009-4223
EXPLOIT:10217 CVE-2009-4779
EXPLOIT:10218 CVE-2009-4082
EXPLOIT:10220 CVE-2009-4220
EXPLOIT:10226 CVE-2009-4097
I want to check if the numbers at the end of the first file match any of the numbers after EXPLOIT:
as others have said, 10k lines aren't a problem for computers that have gigabytes of memory. the important steps are:
figure out how to get the identifier out of lines in the first file
and again, but for the second file
put them together to loop over lines in each file and produce your output
regular expressions are for working with text like this, I get regexes that look like /([0-9]+)/$ and :([0-9]+) for the two files (services like https://regex101.com/ are great for playing)
you can put these together in Python by doing:
from sys import stderr
import re
# collect all exploits for easy matching
exploits = {}
for line in open('file_2'):
m = re.search(r':([0-9]+) ', line)
if not m:
print("couldn't find an id in:", repr(line), file=stderr)
continue
[id] = m.groups()
exploits[id] = line
# match them up
for line in open('file_1'):
m = re.search(r'/([0-9]+)/$', line)
if not m:
print("couldn't find an id in:", repr(line), file=stderr)
continue
[id] = m.groups()
if id in exploits:
pass # print(line, 'matched with', exploits[id])
else:
print(line)

How to load a dataframe from a file containing unwanted characters?

I'm in need of some knowledge on how to fix an error I have made while collecting data. The collected data has the following structure:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
I normally wouldn't have added "[" or "]" to .txt file when writing the data to it, line per line. However, the mistake was made and thus when loading the file it will separate it the following way:
Is there a way to load the data properly to pandas?
On the snippet that I can cut and paste from the question (which I named test.txt), I could successfully read a dataframe via
Purging square brackets (with sed on a Linux command line, but this can be done e.g. with a text editor, or in python if need be)
sed -i 's/^\[//g' test.txt # remove left square brackets assuming they are at the beginning of the line
sed -i 's/\]$//g' test.txt # remove right square brackets assuming they are at the end of the line
Loading the dataframe (in a python console)
import pandas as pd
pd.read_csv("test.txt", skipinitialspace = True, quotechar='"')
(not sure that this will work for the entirety of your file though).
Consider below code which reads the text in myfile.text which looks like below:
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words ,it's basically creating a mini tornado."]
The code below removes [ and ] from the text and then splits every string in the list of string by , excluding the first string which are headers. Some Message contains ,, which causes another column (NAN otherwise) and hence the code takes them into one string, which intended.
Code:
with open('myfile.txt', 'r') as my_file:
text = my_file.read()
text = text.replace("[", "")
text = text.replace("]", "")
df = pd.DataFrame({
'Author': [i.split(',')[0] for i in text.split('\n')[1:]],
'Message': [''.join(i.split(',')[1:]) for i in text.split('\n')[1:]]
}).applymap(lambda x: x.replace('"', ''))
Output:
Author Message
0 littleblackcat There's a lot of redditors here that live in the area maybe/hopefully someone saw something.
1 Kruse In other words it's basically creating a mini tornado.
Here are a few more options to add to the mix:
You could use parse the lines yourself using ast.literal_eval, and then load them into a pd.DataFrame directly using an iterator over the lines:
import pandas as pd
import ast
with open('data', 'r') as f:
lines = (ast.literal_eval(line) for line in f)
header = next(lines)
df = pd.DataFrame(lines, columns=header)
print(df)
Note, however, that calling ast.literal_eval once for each line may not be very fast, especially if your data file has a lot of lines. However, if the data file is not too big, this may be an acceptable, simple solution.
Another option is to wrap an arbitrary iterator (which yields bytes) in an IterStream. This very general tool (thanks to Mechanical snail) allows you to manipulate the contents of any file and then re-package it into a file-like object. Thus, you can fix the contents of the file, and yet still pass it to any function which expects a file-like object, such as pd.read_csv. (Note: I've answered a similar question using the same tool, here.)
import io
import pandas as pd
def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
"""
http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
Lets you use an iterable (e.g. a generator) that yields bytestrings as a
read-only input stream.
The stream implements Python 3's newer I/O API (available in Python 2's io
module).
For efficiency, the stream is buffered.
"""
class IterStream(io.RawIOBase):
def __init__(self):
self.leftover = None
def readable(self):
return True
def readinto(self, b):
try:
l = len(b) # We're supposed to return at most this much
chunk = self.leftover or next(iterable)
output, self.leftover = chunk[:l], chunk[l:]
b[:len(output)] = output
return len(output)
except StopIteration:
return 0 # indicate EOF
return io.BufferedReader(IterStream(), buffer_size=buffer_size)
def clean(f):
for line in f:
yield line.strip()[1:-1]+b'\n'
with open('data', 'rb') as f:
# https://stackoverflow.com/a/50334183/190597 (Davide Fiocco)
df = pd.read_csv(iterstream(clean(f)), skipinitialspace=True, quotechar='"')
print(df)
A pure pandas option is to change the separator from , to ", " in order to have only 2 columns, and then, strip the unwanted characters, which to my understanding are [,], " and space:
import pandas as pd
import io
string = '''
["Author", "Message"]
["littleblackcat", " There's a lot of redditors here that live in the area maybe/hopefully someone saw something. "]
["Kruse", "In other words, it's basically creating a mini tornado."]
'''
df = pd.read_csv(io.StringIO(string),sep='\", \"', engine='python').apply(lambda x: x.str.strip('[\"] '))
# the \" instead of simply " is to make sure python does not interpret is as an end of string character
df.columns = [df.columns[0][2:],df.columns[1][:-2]]
print(df)
# Output (note the space before the There's is also gone
# Author Message
# 0 littleblackcat There's a lot of redditors here that live in t...
# 1 Kruse In other words, it's basically creating a mini...
For now the following solution was found:
sep = '[|"|]'
Using a multi-character separator allowed for the brackets to be stored in different columns in a pandas dataframe, which were then dropped. This avoids having to strip the words line for line.

Trying to read text file and count words within defined groups

I'm a novice Python user. I'm trying to create a program that reads a text file and searches that text for certain words that are grouped (that I predefine by reading from csv). For example, if I wanted to create my own definition for "positive" containing the words "excited", "happy", and "optimistic", the csv would contain those terms. I know the below is messy - the txt file I am reading from contains 7 occurrences of the three "positive" tester words I read from the csv, yet the results print out to be 25. I think it's returning character count, not word count. Code:
import csv
import string
import re
from collections import Counter
remove = dict.fromkeys(map(ord, '\n' + string.punctuation))
# Read the .txt file to analyze.
with open("test.txt", "r") as f:
textanalysis = f.read()
textresult = textanalysis.lower().translate(remove).split()
# Read the CSV list of terms.
with open("positivetest.csv", "r") as senti_file:
reader = csv.reader(senti_file)
positivelist = list(reader)
# Convert term list into flat chain.
from itertools import chain
newposlist = list(chain.from_iterable(positivelist))
# Convert chain list into string.
posstring = ' '.join(str(e) for e in newposlist)
posstring2 = posstring.split(' ')
posstring3 = ', '.join('"{}"'.format(word) for word in posstring2)
# Count number of words as defined in list category
def positive(str):
counts = dict()
for word in posstring3:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
total = sum (counts.values())
return total
# Print result; will write to CSV eventually
print ("Positive: ", positive(textresult))
I'm a beginner as well but I stumbled upon a process that might help. After you read in the file, split the text at every space, tab, and newline. In your case, I would keep all the words lowercase and include punctuation in your split call. Save this as an array and then parse it with some sort of loop to get the number of instances of each 'positive,' or other, word.
Look at this, specifically the "train" function:
https://github.com/G3Kappa/Adjustable-Markov-Chains/blob/master/markovchain.py
Also, this link, ignore the JSON stuff at the beginning, the article talks about sentiment analysis:
https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-
Same applies with this link:
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Good luck!
I looked at your code and passed through some of my own as a sample.
I have 2 idea's for you, based on what I think you may want.
First Assumption: You want a basic sentiment count?
Getting to 'textresult' is great. Then you did the same with the 'positive lexicon' - to [positivelist] which I thought would be the perfect action? Then you converted [positivelist] to essentially a big sentence.
Would you not just:
1. Pass a 'stop_words' list through [textresult]
2. merge the two dataframes [textresult (less stopwords) and positivelist] for common words - as in an 'inner join'
3. Then basically do your term frequency
4. It is much easier to aggregate the score then
Second assumption: you are focusing on "excited", "happy", and "optimistic"
and you are trying to isolate text themes into those 3 categories?
1. again stop at [textresult]
2. download the 'nrc' and/or 'syuzhet' emotional valence dictionaries
They breakdown emotive words by 8 emotional groups
So if you only want 3 of the 8 emotive groups (subset)
3. Process it like you did to get [positivelist]
4. do another join
Sorry, this is a bit hashed up, but if I was anywhere near what you were thinking let me know and we can make contact.
Second apology, Im also a novice python user, I am adapting what I use in R to python in the above (its not subtle either :) )

How to make searching a string in text files quicker

I want to search a list of strings (having from 2k upto 10k strings in the list) in thousands of text files (there may be as many as 100k text files each having size ranging from 1 KB to 100 MB) saved in a folder and output a csv file for the matched text filenames.
I have developed a code that does the required job but it takes around 8-9 hours for 2000 strings to search in around 2000 text files having size of ~2.5 GB in total.
Also, by using this method, system's memory is consumed and so sometimes need to split the 2000 text files into smaller batches for the code to run.
The code is as below(Python 2.7).
# -*- coding: utf-8 -*-
import pandas as pd
import os
def match(searchterm):
global result
filenameText = ''
matchrateText = ''
for i, content in enumerate(TextContent):
matchrate = search(searchterm, content)
if matchrate:
filenameText += str(listoftxtfiles[i])+";"
matchrateText += str(matchrate) + ";"
result.append([searchterm, filenameText, matchrateText])
def search(searchterm, content):
if searchterm.lower() in content.lower():
return 100
else:
return 0
listoftxtfiles = os.listdir("Txt/")
TextContent = []
for txt in listoftxtfiles:
with open("Txt/"+txt, 'r') as txtfile:
TextContent.append(txtfile.read())
result = []
for i, searchterm in enumerate(searchlist):
print("Checking for " + str(i + 1) + " of " + str(len(searchlist)))
match(searchterm)
df=pd.DataFrame(result,columns=["String","Filename", "Hit%"])
Sample Input below.
List of strings -
["Blue Chip", "JP Morgan Global Healthcare","Maximum Horizon","1838 Large Cornerstone"]
Text file -
Usual text file containing different lines separated by \n
Sample Output below.
String,Filename,Hit%
JP Morgan Global Healthcare,000032.txt;000031.txt;000029.txt;000015.txt;,100;100;100;100;
Blue Chip,000116.txt;000126.txt;000114.txt;,100;100;100;
1838 Large Cornerstone,NA,NA
Maximum Horizon,000116.txt;000126.txt;000114.txt;,100;100;100;
As in the example above, first string was matched in 4 files(seperated by ;), second string was matched in 3 files and third string was not matched in any of the files.
Is there a quicker way to search without any splitting of text files?
Your code does a lot of pushing large amounts of data around in memory because you load all files in memory and then search them.
Performance aside, your code could use some cleaning up. Try to write functions as autonomous as possible, without depending on global variables (for input or output).
I rewrote your code using list comprehensions and it became a lot more compact.
# -*- coding: utf-8 -*-
from os import listdir
from os.path import isfile
def search_strings_in_files(path_str, search_list):
""" Returns a list of lists, where each inner list contans three fields:
the filename (without path), a string in search_list and the
frequency (number of occurences) of that string in that file"""
filelist = listdir(path_str)
return [[filename, s, open(path_str+filename, 'r').read().lower().count(s)]
for filename in filelist
if isfile(path_str+filename)
for s in [sl.lower() for sl in search_list] ]
if __name__ == '__main__':
print search_strings_in_files('/some/path/', ['some', 'strings', 'here'])
Mechanism's that I use in this code:
list comprehension to loop thought search_lists and though the files.
compound statements to loop only through the files in a directory (and not through sub directories).
method chaining to directly call a method of an object that is returned.
Tip for reading the list comprehension: try reading it form bottom to top, so:
I convert all items in search_list to lower using list comprehension.
Then I loop over that list (for s in...)
Then I filter out the directory entries that are not files using a compound statement (if isfile...)
Then I loop over all files (for filename...)
In the top line, I create the sublist containing three items:
filename
s, that is the lower case search string
a method chained call to open the file, read all its contents, convert it to lowercase and count the number of occurrences of s.
This code uses all the power there is in "standard" Python functions. If you need more performance, you should look into specialised libraries for this task.

Categories