I wrote a simple code to read a text file. Here's a snippet:
linestring = open(wFile, 'r').read()
# Split on line Feeds
lines = linestring.split('\n')
num = len(lines)
print num
numHeaders = 18
proc = lines[0]
header = {}
for line in lines[1:18]:
keyVal = line.split('=')
header[keyVal[0]] = keyVal[1]
# note that the first member is {'Mode', '5'}
print header[keyVal[0]] # this prints the number '5' correctly
print header['Mode'] # this fails
This last print statement creates the runtime error:
print header['Mode']
KeyError: 'Mode'
The first print statement print header[keyVal[0]] works fine but the second fails!!! keyVal[0] IS the string literal 'Mode'
Why does using the string 'Mode' directly fail?
split() with no arguments will split on all consecutive whitespace, so
'foo bar'.split()
is ['foo', 'bar'].
But if you give it an argument, it no longer removes whitespace for you, so
'foo = bar'.split('=')
is ['foo ', ' bar'].
You need to clean up the whitespace yourself. One way to do that is using a list comprehension:
[s.strip() for s in orig_string.split('=')]
keyVal = map(str.strip,line.split('=')) #this will remove extra whitespace
you have whitespace problems ...
Related
While I am writing this code lst=list(map(int,input().split().strip())) then I am getting an AttributeError 'list' object has no attribute strip
But it is working when I remove the strip() method.
My question is that list object also has no attribute split. So in this case (lst=list(map(int,input().split())) why it is not giving any error and why it is giving error in case of strip() method?
Before you read the rest of the answer: you shouldn't have to strip() after you call split() because split() will consider multiple whitespace characters as a single delimiter and automatically remove the extra whitespace. For example, this snippet evaluates to True:
s1 = "1 2 3"
s2 = "1 2 3"
s3 = " 1 2 3 "
s1.split() == s2.split() == s3.split()
split() and strip() are both attributes of string objects!
When you're confused by code that's been stuffed into one line, it often helps to unravel that code out over multiple lines to understand what it's doing
Your line of code can be unraveled like so:
user_input_str = input()
split_input_list = user_input_str.split()
stripped_input = split_input_list.strip() ### ERROR!!!
lst = list(map(int, stripped_input))
Clearly, you tried to access the strip() method of a list object, and you know that doesn't exist.
In your second example, you do
user_input_str = input()
split_input_list = user_input_str.split()
lst = list(map(int, split_input_list))
Which works perfectly fine because you don't try to access strip() on a list object
Now to fix this, you need to change the order of operations: first, you get your input. Next, strip it. This gives you back a string. Then, split this stripped string.
user_input_str = input()
stripped_input_str = user_input_str.strip() ### No error now!
split_input_list = stripped_input_str.split()
lst = list(map(int, split_input_list))
#or in one line:
lst = list(map(int, input().strip().split()))
Or, if you want to strip each element of the split input list, you will need to map the strip() function to split_input_list like so:
user_input_str = input()
split_input_list = user_input_str.split()
stripped_input_list = list(map(str.strip, split_input_list))
lst = list(map(int, stripped_input_list))
#or in one line
lst = list(map(int, map(str.strip, input().split())))
# or, create a function that calls strip and then converts to int, and map to it
def stripint(value):
return int(value.strip())
lst = list(map(stripint, input().split()))
I have a log file containing lines formatted as shown below. I want to parse the values right next to the substrings element=(string), time=(guint64) and ts=(guint64) and save them to a list that will contain separate lists for each line:
0:00:00.336212023 62327 0x55f5ca5174a0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca532a60, element=(string)rawvideoparse0, src=(string)src, time=(guint64)852315, ts=(guint64)336203035;
0:00:00.336866520 62327 0x55f5ca5176d0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca53f860, element=(string)nvh264enc0, src=(string)src, time=(guint64)6403181, ts=(guint64)336845676;
The final output would then look like: [['rawvideoparse0', 852315, 336203035], ['nvh264enc0', 6403181, 336845676]].
I should probably use Python's string split or partition methods to obtain the relevant parts in each line but I can't come up with a short solution that can be generalised for the values that I'm searching for. I also don't know how to deal with the fact that the values element and time are terminated with a comma whereas ts is terminated with a semicolon (without writing separate conditional for the two cases). How can I achieve this using the string manipulation methods in Python?
Regex was meant for this:
lines = """
0:00:00.336212023 62327 0x55f5ca5174a0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca532a60, element=(string)rawvideoparse0, src=(string)src, time=(guint64)852315, ts=(guint64)336203035;
0:00:00.336866520 62327 0x55f5ca5176d0 TRACE GST_TRACER :0:: element-latency, element-id=(string)0x55f5ca53f860, element=(string)nvh264enc0, src=(string)src, time=(guint64)6403181, ts=(guint64)336845676;
"""
import re
pattern = re.compile(".*element-id=\\(string\\)(?P<elt_id>.*), element=\\(string\\)(?P<elt>.*), src=\\(string\\)(?P<src>.*), time=\\(guint64\\)(?P<time>.*), ts=\\(guint64\\)(?P<ts>.*);")
for l in lines.splitlines():
match = pattern.match(l)
if match:
results = match.groupdict()
print(results)
yields the following dictionaries (notice that the captured groups have been named in the regex above using (?P<name>...), thats why we have these names) :
{'elt_id': '0x55f5ca532a60', 'elt': 'rawvideoparse0', 'src': 'src', 'time': '852315', 'ts': '336203035'}
{'elt_id': '0x55f5ca53f860', 'elt': 'nvh264enc0', 'src': 'src', 'time': '6403181', 'ts': '336845676'}
You can make this regex pattern even more generic, since all the elements share a common structure <name>=(<type>)<value>:
pattern2 = re.compile("(?P<name>[^,;\s]*)=\\((?P<type>[^,;]*)\\)(?P<value>[^,;]*)")
for l in lines.splitlines():
all_interesting_items = pattern2.findall(l)
print(all_interesting_items)
it yields:
[]
[('element-id', 'string', '0x55f5ca532a60'), ('element', 'string', 'rawvideoparse0'), ('src', 'string', 'src'), ('time', 'guint64', '852315'), ('ts', 'guint64', '336203035')]
[('element-id', 'string', '0x55f5ca53f860'), ('element', 'string', 'nvh264enc0'), ('src', 'string', 'src'), ('time', 'guint64', '6403181'), ('ts', 'guint64', '336845676')]
Note that in all cases, https://regex101.com/ is your friend for debugging regex :)
Here is a possible solution using a series of split commands:
output = []
with open("log.txt") as f:
for line in f:
values = []
line = line.split("element=(string)", 1)[1]
values.append(line.split(",", 1)[0])
line = line.split("time=(guint64)", 1)[1]
values.append(int(line.split(",", 1)[0]))
line = line.split("ts=(guint64)", 1)[1]
values.append(int(line.split(";", 1)[0]))
output.append(values)
This is not the fastest solution, but this is probably how I would code it for readability.
# create empty list for output
list_final_output = []
# filter substrings
list_filter = ['element=(string)', 'time=(guint64)', 'ts=(guint64)']
# open the log file and read in the lines as a list of strings
with open('so_58272709.log', 'r') as f_log:
string_example = f_log.read().splitlines()
print(f'string_example: \n{string_example}\n')
# loop through each line in the list of strings
for each_line in string_example:
# split each line by comma
list_split_line = each_line.split(',')
# loop through each filter substring, include filter
filter_string = [x for x in list_split_line if (list_filter[0] in x
or list_filter[1] in x
or list_filter[2] in x
)]
# remove the substring
filter_string = [x.replace(list_filter[0], '') for x in filter_string]
filter_string = [x.replace(list_filter[1], '') for x in filter_string]
filter_string = [x.replace(list_filter[2], '') for x in filter_string]
# store results of each for-loop
list_final_output.append(filter_string)
# print final output
print(f'list_final_output: \n{list_final_output}\n')
I am learning Python and am struggling with fining an exact word in each string in a list of strings.
Apologies if this is an already asked question for this situation.
This is what my code looks like so far:
with open('text.txt') as f:
lines = f.readlines()
lines = [line.rstrip('\n') for line in open('text.txt')]
keyword = input("Enter a keyword: ")
matching = [x for x in lines if keyword.lower() in x.lower()]
match_count = len(matching)
print('\nNumber of matches: ', match_count, '\n')
print(*matching, sep='\n')
Right now, matching will return all strings containing the word, not strings contating the exact word. For example, if I enter in 'local' as the keyword, strings with 'locally' and 'localized' in addition to 'local' will be returned when I only want just instances of 'local' returned.
I have tried:
match_test = re.compile(r"\b" + keyword+ r"\b")
match_test = ('\b' + keyword + '\b')
match_test = re.compile('?:^|\s|$){0}'.format(keyword))
matching = [x for x in lines if keyword.lower() == x.lower()]
matching = [x for x in lines if keyword.lower() == x.lower().strip()]
And none of them shave worked, so I'm a bit stuck.
How do I take the keyword entered from the user, and then return all strings in a list that contain that exact keyword?
Thanks
in means contained in, 'abc' in 'abcd' is True. For exact match use ==
matching = [x for x in lines if keyword.lower() == x.lower()]
You might need to remove spaces\new lines as well
matching = [x for x in lines if keyword.lower().strip() == x.lower().strip()]
Edit:
To find a line containing the keyword you can use loops
matches = []
for line in lines:
for string in line.split(' '):
if string.lower().strip() == keyword.lower().strip():
matches.append(line)
This method avoids having to read the whole file into memory. It also deals with cases like "LocaL" or "LOCAL" assuming you want to capture all such variants. There is a bit of performance overhead on making the temp string each time the line is read, however:
import re
reader(filename, target):
#this regexp matches a word at the front, end or in the middle of a line stripped
#of all punctuation and other non-alpha, non-whitespace characters:
regexp = re.compile(r'(^| )' + target.lower() + r'($| )')
with open(filename) as fin:
matching = []
#read lines one at at time:
for line in fin:
line = line.rstrip('\n')
#generates a line of lowercase and whitespace to test against
temp = ''.join([x.lower() for x in line if x.isalpha() or x == ' '])
print(temp)
if regexp.search(temp):
matching.append(line) #store unaltered line
return matching
Given the following tests:
locally local! localized
locally locale nonlocal localized
the magic word is Local.
Localized or nonlocal or LOCAL
This is returned:
['locally local! localized',
'the magic word is Local.',
'Localized or nonlocal or LOCAL']
Please find my solution which should match only local among following mentioned text in text file . I used search regular expression to find the instance which has only 'local' in string and other strings containing local will not be searched for .
Variables which were provided in text file :
local
localized
locally
local
local diwakar
local
local##!
Code to find only instances of 'local' in text file :
import os
import sys
import time
import re
with open('C:/path_to_file.txt') as f:
for line in f:
a = re.search(r'local\W$', line)
if a:
print(line)
Output
local
local
local
Let me know if this is what you were looking for
Your first test seems to be on the right track
Using input:
import re
lines = [
'local student',
'i live locally',
'keyboard localization',
'what if local was in middle',
'end with local',
]
keyword = 'local'
Try this:
pattern = re.compile(r'.*\b{}\b'.format(keyword.lower()))
matching = [x for x in lines if pattern.match(x.lower())]
print(matching)
Output:
['local student', 'what if local was in middle', 'end with local']
pattern.match will return the first instance of the regex matching or None. Using this as your if condition will filter for strings that match the whole keyword in some place. This works because \b matches the begining/ending of words. The .* works to capture any characters that may occur at the start of the line before your keyword shows up.
For more info about using Python's re, checkout the docs here: https://docs.python.org/3.8/library/re.html
You can try
pattern = re.compile(r"\b{}\b".format(keyword))
match_test = pattern.search(line)
like shown in
Python - Concat two raw strings with an user name
I have the following code that extracts the Message-Id in gathers them in a Dataframe.It works and gives me the follwing results :
This an example of the lines in the dataframe :
Message-ID: <23272646.1075847145300.JavaMail.evans#thyme>
What I want to have is only the string after < character and the before >. Because Message-ID ends with >. Also I have some lines where the Message-ID value is empty. I want to delete these lines.
Here is the code that I wrote
import pandas as pd
import numpy as np
f = open('C:\\Users\\hmk\\Desktop\\PFE 2019\\ML\\MachineLearningPhishing-
master\\MachineLearningPhishing-master\\code\\resources\\emails-
enron.mbox','r')
line_num = 0
e = []
search_phrase = "Message-ID"
for line in f.readlines():
line_num += 1
if line.find(search_phrase) >= 0:
#line = line[13:]
#line = line[:-2]
e.append(line)
f.close()
dfObj = pd.DataFrame(e)
One way to do it is using regex and pandas DataFrame replace:
clean_df = df.replace(to_replace='\<|\>', value='', regex=True)
clean_df = clean_df.replace(to_replace='(Message-ID:\s*$)', value=np.nan, regex=True).dropna()
the first line of code is removing the < and >, assuming the msgs will only contain those two
the second is checking if there is a message id on the body, if not it will replace for NaN.
note that I used numpy.nan just to simplify the process of dropping the blank msgs
You can use a regex which will extract the desired Message-ID for you.
So your first part for extracting the message id would be like below:
import re # import regex
s = 'Message-ID: <23272646.1075847145300.JavaMail.evans#thyme>'
message_id = re.search(r'Message-ID: <(.*?)>', s).group(1)
print('message_id: ', message_id)
Your ideal Message ID:
>>> message_id: 23272646.1075847145300.JavaMail.evans#thyme>
So you can loop through your data end check for the regex like this:
for line in f.readlines():
line_num += 1
message_id = re.search(r'Message-ID: <(.*?)>', line)
if message_id:
msg_id_string = message_id.group(1)
e.append(line)
# your other works
The if message_id: checks whether there is a match for your Message-ID and if it doesn't match it will return None and won't go through the if instructions.
You want a substring of your lines
for line in f.readlines():
if all(word in line for word in [search_phrase, "<", ">"]):
e.append(line[line.find("<")+1:-1])
#-1 suppose ">" as the last character
Use in to check if a string is inside another string
Use find to get the index of your pattern
Use [in:out] to get substring between your two values
s = "We want <This text inside only>. yes we do."
s2 = s[s.find("<")+1:s.find(">")]
print(s2) # Prints : This text inside only
# If you want to remove empty lines :
lines = filter(lambda x: x.strip(), lines)
filter goes through the whole lines, no need for a for loop that way.
One suggestion for you:
import re
f = open('PATH/TO/FILE', 'r').read()
msgID = re.findall(r'(?<=<).*?(?=>)', f)
I have to strip whitespace for extracted strings, one string at a time for which I'm using split(). The split() function returns list after removing white spaces. I want to store this in my own dynamic list since I have to aggregate all of the strings.
The snippet of my code:
while rec_id = "ffff"
output = procs.run_cmd("get sensor info", command)
sdr_li = []
if output:
byte_str = output[0]
str_1 = byte_str.split(' ')
for byte in str_1:
sdr_li.append(byte)
rec_id = get_rec_id()
Output = ['23 0a 06 01 52 2D 12']
str_1 = ['23','0a','06','01','52','2D','12']
This does not look very elegant, transferring from one list to another. Is there another way to achieve this.
list.extend():
sdr_li.extend(str_1)
str.split() returns you a list so just add your list's items to the main list. Use extend https://docs.python.org/2/tutorial/datastructures.html
so rewriting your data into something legible and properly indented you'd get:
my_list = list
while rec_id = "ffff"
output = procs.run_cmd("get sensor info", command)
if output:
result_string = output[0]
# extend my_list with the list resulting from the whitespace
# seperated tokens of the output
my_list.extend( result_string.split() )
pass # end if
rec_id = get_rec_id()
...
pass # end while