How to parse a value next to a specific substring in Python - python

I have a log file containing lines formatted as shown below. I want to parse the values right next to the substrings element=(string), time=(guint64) and ts=(guint64) and save them to a list that will contain separate lists for each line:
0:00:00.336212023 62327 0x55f5ca5174a0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca532a60, element=(string)rawvideoparse0, src=(string)src, time=(guint64)852315, ts=(guint64)336203035;
0:00:00.336866520 62327 0x55f5ca5176d0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca53f860, element=(string)nvh264enc0, src=(string)src, time=(guint64)6403181, ts=(guint64)336845676;
The final output would then look like: [['rawvideoparse0', 852315, 336203035], ['nvh264enc0', 6403181, 336845676]].
I should probably use Python's string split or partition methods to obtain the relevant parts in each line but I can't come up with a short solution that can be generalised for the values that I'm searching for. I also don't know how to deal with the fact that the values element and time are terminated with a comma whereas ts is terminated with a semicolon (without writing separate conditional for the two cases). How can I achieve this using the string manipulation methods in Python?

Regex was meant for this:
lines = """
0:00:00.336212023 62327 0x55f5ca5174a0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca532a60, element=(string)rawvideoparse0, src=(string)src, time=(guint64)852315, ts=(guint64)336203035;
0:00:00.336866520 62327 0x55f5ca5176d0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca53f860, element=(string)nvh264enc0, src=(string)src, time=(guint64)6403181, ts=(guint64)336845676;
"""
import re
pattern = re.compile(".*element-id=\\(string\\)(?P<elt_id>.*), element=\\(string\\)(?P<elt>.*), src=\\(string\\)(?P<src>.*), time=\\(guint64\\)(?P<time>.*), ts=\\(guint64\\)(?P<ts>.*);")
for l in lines.splitlines():
match = pattern.match(l)
if match:
results = match.groupdict()
print(results)
yields the following dictionaries (notice that the captured groups have been named in the regex above using (?P<name>...), thats why we have these names) :
{'elt_id': '0x55f5ca532a60', 'elt': 'rawvideoparse0', 'src': 'src', 'time': '852315', 'ts': '336203035'}
{'elt_id': '0x55f5ca53f860', 'elt': 'nvh264enc0', 'src': 'src', 'time': '6403181', 'ts': '336845676'}
You can make this regex pattern even more generic, since all the elements share a common structure <name>=(<type>)<value>:
pattern2 = re.compile("(?P<name>[^,;\s]*)=\\((?P<type>[^,;]*)\\)(?P<value>[^,;]*)")
for l in lines.splitlines():
all_interesting_items = pattern2.findall(l)
print(all_interesting_items)
it yields:
[]
[('element-id', 'string', '0x55f5ca532a60'), ('element', 'string', 'rawvideoparse0'), ('src', 'string', 'src'), ('time', 'guint64', '852315'), ('ts', 'guint64', '336203035')]
[('element-id', 'string', '0x55f5ca53f860'), ('element', 'string', 'nvh264enc0'), ('src', 'string', 'src'), ('time', 'guint64', '6403181'), ('ts', 'guint64', '336845676')]
Note that in all cases, https://regex101.com/ is your friend for debugging regex :)

Here is a possible solution using a series of split commands:
output = []
with open("log.txt") as f:
for line in f:
values = []
line = line.split("element=(string)", 1)[1]
values.append(line.split(",", 1)[0])
line = line.split("time=(guint64)", 1)[1]
values.append(int(line.split(",", 1)[0]))
line = line.split("ts=(guint64)", 1)[1]
values.append(int(line.split(";", 1)[0]))
output.append(values)

This is not the fastest solution, but this is probably how I would code it for readability.
# create empty list for output
list_final_output = []
# filter substrings
list_filter = ['element=(string)', 'time=(guint64)', 'ts=(guint64)']
# open the log file and read in the lines as a list of strings
with open('so_58272709.log', 'r') as f_log:
string_example = f_log.read().splitlines()
print(f'string_example: \n{string_example}\n')
# loop through each line in the list of strings
for each_line in string_example:
# split each line by comma
list_split_line = each_line.split(',')
# loop through each filter substring, include filter
filter_string = [x for x in list_split_line if (list_filter[0] in x
or list_filter[1] in x
or list_filter[2] in x
)]
# remove the substring
filter_string = [x.replace(list_filter[0], '') for x in filter_string]
filter_string = [x.replace(list_filter[1], '') for x in filter_string]
filter_string = [x.replace(list_filter[2], '') for x in filter_string]
# store results of each for-loop
list_final_output.append(filter_string)
# print final output
print(f'list_final_output: \n{list_final_output}\n')

Related

Extract time values from a list and add to a new list or array

I have a script that reads through a log file that contains hundreds of these logs, and looks for the ones that have a "On, Off, or Switch" type. Then I output each log into its own list. I'm trying to find a way to extract the Out and In times into a separate list/array and then subtract the two times to find the duration of each separate log. This is what the outputted logs look like:
['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
This is my current code:
logfile = '/path/to/my/logfile'
with open(logfile, 'r') as f:
text = f.read()
words = ["On", "Off", "Switch"]
text2 = text.split('\n')
for l in text.split('\n'):
if (words[0] in l or words[1] in l or words[2] in l):
log = l.split(',')[0:3]
I'm stuck on how to target only the Out and In time values from the logs and put them in an array and convert to a time value to find duration.
Initial log before script: everything after the "In" time is useless for what I'm looking for so I only have the first three indices outputted
2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a","Type":"Switch,"In":"2020-01-31T00:30:20.140Z","Path":"interface","message":"interface changed status from unknown to normal","severity":"INFORMATIONAL","display":true,"json_map":"{\"severity\":null,\"eventId\":\"65e-64d9-45-ab62-8ef98ac5e60d\",\"componentPath\":\"interface_css\",\"displayToGui\":false,\"originalState\":\"unknown\",\"closed\":false,\"eventType\":\"InterfaceStateChange\",\"time\":\"2019-04-18T07:04:32.747Z\",\"json_map\":null,\"message\":\"interface_css changed status from unknown to normal\",\"newState\":\"normal\",\"info\":\"Event created with current status\"}","closed":false,"info":"Event created with current status","originalState":"unknown","newState":"normal"}
Below is a possible solution. The wordmatch line is a bit of a hack, until I find something clearer: it's just a one-liner that create an empty or 1-element set of True if one of the words matches.
(Untested)
import re
logfile = '/path/to/my/logfile'
words = ["On", "Off", "Switch"]
dateformat = r'\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[Zz]?'
pattern = fr'Out:\s*\[(?P<out>{dateformat})\].*In":\s*\"(?P<in>{dateformat})\"'
regex = re.compile(pattern)
with open(logfile, 'r') as f:
for line in f:
wordmatch = set(filter(None, (word in s for word in words)))
if wordmatch:
match = regex.search(line)
if match:
intime = match.group('in')
outtime = match.group('out')
# whatever to store these strings, e.g., append to list or insert in a dict.
As noted, your log example is very awkward, so this works for the example line, but may not work for every line. Adjust as necessary.
I have also not included (if so wanted), a conversion to a datetime.datetime object. For that, read through the datetime module documentation, in particular datetime.strptime. (Alternatively, you may want to store your results in a Pandas table. In that case, read through the Pandas documentation on how to convert strings to actual datetime objects.)
You also don't need to read nad split on newlines yourself: for line in f will do that for you (provided f is indeed a filehandle).
Regex is probably the way to go (fastness, efficiency etc.) ... but ...
You could take a very simplistic (if very inefficient) approach of cleaning your data:
join all of it into a string
replace things that hinder easy parsing
split wisely and filter the split
like so:
data = ['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
all_text = " ".join(data)
# this is inefficient and will create throwaway intermediate strings - if you are
# in a hurry or operate on 100s of MB of data, this is NOT the way to go, unless
# you have time
# iterate pairs of ("bad thing", "what to replace it with") (or list of bad things)
for thing in [ (": ",":"), (list('[]{}"'),"") ]:
whatt = thing[0]
withh = thing[1]
# if list, do so for each bad thing
if isinstance(whatt, list):
for p in whatt:
# replace it
all_text = all_text.replace(p,withh)
else:
all_text = all_text.replace(whatt,withh)
# format is now far better suited to splitting/filtering
cleaned = [a for a in all_text.split(" ")
if any(a.startswith(prefix) or "Switch" in a
for prefix in {"In:","Switch:","Out:"})]
print(cleaned)
Outputs:
['Out:2020-01-31T00:30:20.150Z', 'Type:Switch', 'In:2020-01-31T00:30:20.140Z']
After cleaning your data would look like:
2020-01-31T12:04:57.976Z 1234 Out:2020-01-31T00:30:20.150Z Id:Id:4-f-4-9-6a Type:Switch In:2020-01-31T00:30:20.140Z
You can transform the clean list into a dictionary for ease of lookup:
d = dict( part.split(":",1) for part in cleaned)
print(d)
will produce:
{'In': '2020-01-31T00:30:20.140Z',
'Type': 'Switch',
'Out': '2020-01-31T00:30:20.150Z'}
You can use datetime module to parse the times from your values as shown in 0 0 post.

Python regular expression to split parameterized text file

I'm trying to split a file that contains 'string = float' format repeatedly.
Below is how the file looks like.
+name1 = 32 name2= 4
+name3 = 2 name4 = 5
+name5 = 2e+23
...
And I want them to put it a dictionary.
Like...
a={name1:32, name2:4, name3:2, name4:5, name5:2e+23}
I'm new to regular expression and having a hard time figuring out what to do.
After some googling, I tried to do as below to remove "+" character and white space..
p=re.compile('[^+\s]+')
splitted_list=p.findall(lineof_file)
But this gave me two problems..
1. when there is no space btw name and "=" sign, it doesn't spilit.
2. for numbers like 2e+23, it split the + sign in between.
I managed to parse the file as I wanted after some modification of depperm's code.
But I'm facing another problem.
To better explain my problems. Below is how my file can look like.
After + sign multiple parameter and value pair can appear with '=' sign.
The parameter name can contain alphabet and number in any position. Also value can contain +- sign with scientific notification(E/e-+). And sometimes value can be a math expression if it is single quoted.
+ abc2dfg3 = -2.3534E-03 dfe4c3= 2.000
+ abcdefg= '1.00232e-1*x' * bdfd=1e-3
I managed to parse the above using the below regex.
re.findall("(\w+)\s*=\s*([+-]?[\d+.Ee+-]+|'[^']+')",eachline)
But now my problem is sometimes like "* bdfd=1e-3", there could be some comment. Anything after *(asterisk) in my file should be treated as comment but not if * present inside single quoted string.
With above regex, it parses "bdfd=1e-3" as well but I want it to be not parsed.
I tried to find solution for hours but I couldn't find any solution so far.
I would suggest just grabbing the name and the value instead of worrying about the spaces or unwanted characters. I'd use this regex: (name\d+)\s?=\s?([\de+]+) which will get the name and then you also group the number even if it has an e or space.
import re
p=re.compile('(name\d+)\s*=\s*([\de+]+)')
a ={}
with open("file.txt", "r") as ins:
for line in ins:
splitted_list=p.findall(line)
#splitted_list looks like: [('name1', '32'), ('name2', '4')]
for group in splitted_list:
a[group[0]]=group[1]
print(a)
#{'name1': '32', 'name2': '4', 'name3': '2', 'name4': '5', 'name5': '2e+23'}
You don't need a regular expression to accomplish your goal. You can use built-in Python methods.
your_dictionary = {}
# Read the file
with open('file.txt','r') as fin:
lines = fin.readlines()
# iterate over each line
for line in lines:
splittedLine = line.split('=')
your_dictionary.push({dict.push({
key: splittedLine[0],
value: splittedLine[1]
});
print(your_dictionary)
Hope it helps!
You can combine regex with string splitting:
Create the file:
t ="""
+name1 = 32 name2= 4
+name3 = 2 name4 = 5
+name5 = 2e+23"""
fn = "t.txt"
with open(fn,"w") as f:
f.write(t)
Split the file:
import re
d = {}
with open(fn,"r") as f:
for line in f: # proces each line
g = re.findall(r'(\w+ ?= ?[^ ]*)',line) # find all name = something
for hit in g: # something != space
hit = hit.strip() # remove spaces
if hit:
key, val = hit.split("=") # split and strip and convert
d[key.rstrip()] = float(val.strip()) # put into dict
print d
Output:
{'name4': 5.0, 'name5': 2e+23, 'name2': 4.0, 'name3': 2.0, 'name1': 32.0}

write a whole line in a .txt file if not in a .yaml file

I am trying to write in a text (download.txt) the lines from open.txt that there are not the same 'id' and there are not in excepcions (idexception, classexcepcion). I have got writing the 'ids' not repeated and the idexcepcion.
MY QUESTION is how to add the condition 'classexception', I tried it but it is impossible. Any idea about dictionaries/conditionals I have to use?
c = open('open.txt','r') #structure: name:xxx; id:xxxx; class:xxxx; name:xxx; id:xxxx;class:xxxx etc
t=c.read()
d=open('download.txt','a')
allLines = t.split("\n")
lines = {}
class=[s[10:-1] for s in t.split() if s.startswith("class")]
for line in allLines:
idPos = line.find("id:")
colPos = line.find(";",idPos)
if idPos > -1:
id = line[idPos+4: colPos if colPos > -1 else None]
if id not in idexception:
lines.setdefault(id,line)
for l in lines:
d.write(lines[l]+'\n')
c.close()
d.close()
Generally you are quite unclear but if I understand correctly here is my approach to your problem with a lot o comments inside:
import re
id_exceptions = ['id_ex_1', 'id_ex_2']
class_exceptions = ['class_ex_1', 'class_ex_2']
# Values to be written to dowload.txt file
# Since id's needs to be unique, structure of this dict should be like this:
# {[single key as value of an id]: {name: xxx, class: xxx}}
unique_values = dict()
# All files should be opened using 'with' statement
with open('open.txt') as source:
# Read whole file into one single long string
all_lines = source.read().replace('\n', '')
# Prepare regular expression for geting values from: name, id and class as a dict
# Read https://regex101.com/r/Kby3fY/1 for extra explanation what does it do
reg_exp = re.compile('name:(?<name>[a-zA-Z0-9_-]*);\sid:(?<id>[a-zA-Z0-9_-]*);\sclass:(?<class>[a-zA-Z0-9_-]*);')
# Read single long string and match to above regular expression
for match in reg_exp.finditer(all_lines):
# This will produce a single dict {'name': xxx, 'id': xxx, 'class': xxx}
single_line = match.groupdict()
# Now we will check againt all conditions at once and
# if they are not True we will add values as an unique id
if single_line['id'] not in unique_values or # Check if not present already
single_line['id'] not in id_exceptions or # Check is not in id exceptions
single_line['class'] not in class_exceptions: # Check is not in class exceptions
# Add unique id values
unique_values[single_line['id']] = {'name': single_line['name'],
'class': single_line['class']}
# Now we just need to write it to download.txt file
with open('download.txt', 'w') as destintion:
for key, value in all_lines.items(): # In Python 2.x use all_lines.iteritems()
line = "id:{}; name:{}; class:{}".format(key, value['name'], value['class'])

Reading in data from file using regex in python

I have a data file with tons of data like:
{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Passenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}
I want to read in the data and save it in a list. I am having trouble getting the exact right code to exact the data between the { }. I don't want the quotes and the ` after the numbers. Also, data is not separated by line so how do I tell re.search where to begin looking for the next set of data?
At first glance, you can break this data into chunks by splitting it on the string },{:
chunks = data.split('},{')
chunks[0] = chunks[0][1:] # first chunk started with '{'
chunks[-1] = chunks[-1][:-1] # last chunk ended with '}'
Now you have chunks like
"Passenger Quarters",27.`,"Cardassian","not injured"
and you can apply a regular expression to them.
You should do this in two passes. One to get the list of items and one to get the contents of each item:
import re
from pprint import pprint
data = '{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Passenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}'
# This splits up the data into items where each item is the
# contents inside a pair of braces
item_pattern = re.compile("{([^}]+)}")
# This plits up each item into it's parts. Either matching a string
# inside quotation marks or a number followed by some garbage
contents_pattern = re.compile('(?:"([^"]+)"|([0-9]+)[^,]+),?')
rows = []
for item in item_pattern.findall(data):
row = []
for content in contents_pattern.findall(item):
if content[1]: # Number matched, treat it as one
row.append(int(content[1]))
else: # Number not matched, use the string (even if empty)
row.append(content[0])
rows.append(row)
pprint(rows)
The following will produce a list of lists, where each list is an individual record.
import re
data = '{"Passenger Quarters",27.`,"Cardassian","not injured"},{"Passenger Quarters",9.`,"Cardassian","injured"},{"Pssenger Quarters",32.`,"Romulan","not injured"},{"Bridge","Unknown","Romulan","not injured"}'
# remove characters we don't want and split into individual fields
badchars = ['{','}','`','.','"']
newdata = data.translate(None, ''.join(badchars))
fields = newdata.split(',')
# Assemble groups of 4 fields into separate lists and append
# to the parent list. Obvious weakness here is if there are
# records that contain something other than 4 fields
records = []
myrecord = []
recordcount = 1
for field in fields:
myrecord.append(field)
recordcount = recordcount + 1
if (recordcount > 4):
records.append(myrecord)
myrecord = []
recordcount = 1
for record in records:
print record
Output:
['Passenger Quarters', '27', 'Cardassian', 'not injured']
['Passenger Quarters', '9', 'Cardassian', 'injured']
['Pssenger Quarters', '32', 'Romulan', 'not injured']
['Bridge', 'Unknown', 'Romulan', 'not injured']

How to use key/value pairs in a Python dictionary

I wrote a simple code to read a text file. Here's a snippet:
linestring = open(wFile, 'r').read()
# Split on line Feeds
lines = linestring.split('\n')
num = len(lines)
print num
numHeaders = 18
proc = lines[0]
header = {}
for line in lines[1:18]:
keyVal = line.split('=')
header[keyVal[0]] = keyVal[1]
# note that the first member is {'Mode', '5'}
print header[keyVal[0]] # this prints the number '5' correctly
print header['Mode'] # this fails
This last print statement creates the runtime error:
print header['Mode']
KeyError: 'Mode'
The first print statement print header[keyVal[0]] works fine but the second fails!!! keyVal[0] IS the string literal 'Mode'
Why does using the string 'Mode' directly fail?
split() with no arguments will split on all consecutive whitespace, so
'foo bar'.split()
is ['foo', 'bar'].
But if you give it an argument, it no longer removes whitespace for you, so
'foo = bar'.split('=')
is ['foo ', ' bar'].
You need to clean up the whitespace yourself. One way to do that is using a list comprehension:
[s.strip() for s in orig_string.split('=')]
keyVal = map(str.strip,line.split('=')) #this will remove extra whitespace
you have whitespace problems ...

Categories