Python read tsv file and evaluate

Python read tsv file and evaluate - python

I have tsv file which is prepared like:
*Settings*
Force, Tags FakeTag
Resource ../../robot_resources/global.tsv
*Test, Cases*
Testcase1 [Documentation] PURPOSE:
... Test, checking,,
...
...
... PRECONDITION:
... set,
... make, sure, that,
...
...
[Setup] stopProcessAndWaitForIt
startAdvErrorTrace
startAdvDumpEvents
stopProcessAndWaitForIt
stopAndDownloadAdvDumpEvents
Testcase2 [Documentation] PURPOSE:
... Test, checking,, if,
...
...
... PRECONDITION:
What i want to do is:
- start reading file from Test, Cases
- read separatly every testcase: testcase1, testcase2..n (every testcase starts without tab, testcase body starts with tab)
- evaluate if all testcases has expresions "startAdvErrorTrace" "startAdvDumpEvents" etc
I have in tsv about 50 testcases and want evaluate all file
I'm totally green in developing. I have found some ideas like read csv file as tsv. But i don't know how to achieve my expectations

I don't know what file format this is, but you can do something like this:
items = ("startAdvErrorTrace", "startAdvDumpEvents") # add more as desired
import re
with open("testfile") as infile:
# skip contents until Test Cases
contents = re.search(r"(?s)\*Test, Cases\*.*", infile.read())
cases = contents.split("\n\n") # Split on two consecutive newlines
for case in cases:
if not all(item in case for item in items)
print("Not all items found in case:")
print(case)

Here's a little script to parse the flags per Testcase. The output is:
Flags per testcase:
{1: ['startAdvErrorTrace', 'startAdvDumpEvents'], 2: []}
And the script is:
usage = {}
flags = ('startAdvErrorTrace', 'startAdvDumpEvents')
with(open(testfile)) as f:
lines = [line.strip() for line in f.readlines()]
start_parsing = False
for line in lines:
if 'Test, Cases' in line:
start_parsing = True
continue
if parse_cases:
if 'Testcase' in line:
case_nr = int(re.findall('\d+', line)[0])
usage[case_nr] = []
continue
for flag in flags:
if flag in line:
usage[case_nr].append(flag)
print 'Flags per testcase:'
usage
Hope that helps.

Related

How to create multiline lists from raw file and process them in python?

I have the following code that takes raw_file.txt and turns it into processed_file.txt.
Problem 1:
Besides item_location I also need the item_id (as str, not int) to be in the processed file, perhaps as list so it would look like WANTED_processed_file.txt
def process_file(raw_file, processed_file, target1):
with open(raw_file, 'r') as raw_file:
with open(processed_file, 'a') as processed_file:
for line in raw_file:
if target1 in line:
processed_file.write(line.split(target1)[1])
process_file('raw_file.txt', 'processed_file.txt', 'item_location: ')
By adding another if statement with target2, the content is appended below target1 (as expected), but I don't actually know how to make it a list.
Problem 2:
With my current code I'm only able to process the string corresponding to the line, but since WANTED_processed_file.txt contains multiple list I need to adapt it.
def my_function():
print(i)
with open('processed_file.txt', "r") as processed_file:
items = processed_file.read().splitlines()
for i in items:
my_function()
This is what I've tried but I'm not getting the desired results:
def my_function():
print(f'Item {i[0]} is located at {i[1]}')
with open('WANTED_processed_file.txt', "r") as processed_file:
items = processed_file.read()
for i in items:
my_function()
raw_file.txt:
ITEM:
item_id: 0001
item_location: first location
item_description: something
ITEM:
item_id: 0002
item_location: second location
item_description: something else
processed_file.txt:
first location
second location
WANTED_processed_file.txt:
['0001', 'first location']
['0002', 'second location']
Thank you and apologies for the lengthy post

You want to parse a multiline blocs from a text file. The robust way would be to process it line by line searching for start of bloc markers and data markers, fill a data structure and then store the data to a file.
If you are sure that your items will always be in the same order, you could use a multiline regex:
import re
...
with open('raw_file.txt') as fd:
text = fd.read()
rx = re.compile(r'ITEM:.*?item_id: ([^\n]*).*?item_location: ([^\n]*)',
re.MULTILINE | re.DOTALL)
with open('processed_file', 'w') as fd:
for record in rx.finditer(t):
print(list(record), file=fd)
But beware, it will be less robust than a true parser...

Extract time values from a list and add to a new list or array

I have a script that reads through a log file that contains hundreds of these logs, and looks for the ones that have a "On, Off, or Switch" type. Then I output each log into its own list. I'm trying to find a way to extract the Out and In times into a separate list/array and then subtract the two times to find the duration of each separate log. This is what the outputted logs look like:
['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
This is my current code:
logfile = '/path/to/my/logfile'
with open(logfile, 'r') as f:
text = f.read()
words = ["On", "Off", "Switch"]
text2 = text.split('\n')
for l in text.split('\n'):
if (words[0] in l or words[1] in l or words[2] in l):
log = l.split(',')[0:3]
I'm stuck on how to target only the Out and In time values from the logs and put them in an array and convert to a time value to find duration.
Initial log before script: everything after the "In" time is useless for what I'm looking for so I only have the first three indices outputted
2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a","Type":"Switch,"In":"2020-01-31T00:30:20.140Z","Path":"interface","message":"interface changed status from unknown to normal","severity":"INFORMATIONAL","display":true,"json_map":"{\"severity\":null,\"eventId\":\"65e-64d9-45-ab62-8ef98ac5e60d\",\"componentPath\":\"interface_css\",\"displayToGui\":false,\"originalState\":\"unknown\",\"closed\":false,\"eventType\":\"InterfaceStateChange\",\"time\":\"2019-04-18T07:04:32.747Z\",\"json_map\":null,\"message\":\"interface_css changed status from unknown to normal\",\"newState\":\"normal\",\"info\":\"Event created with current status\"}","closed":false,"info":"Event created with current status","originalState":"unknown","newState":"normal"}

Below is a possible solution. The wordmatch line is a bit of a hack, until I find something clearer: it's just a one-liner that create an empty or 1-element set of True if one of the words matches.
(Untested)
import re
logfile = '/path/to/my/logfile'
words = ["On", "Off", "Switch"]
dateformat = r'\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[Zz]?'
pattern = fr'Out:\s*\[(?P<out>{dateformat})\].*In":\s*\"(?P<in>{dateformat})\"'
regex = re.compile(pattern)
with open(logfile, 'r') as f:
for line in f:
wordmatch = set(filter(None, (word in s for word in words)))
if wordmatch:
match = regex.search(line)
if match:
intime = match.group('in')
outtime = match.group('out')
# whatever to store these strings, e.g., append to list or insert in a dict.
As noted, your log example is very awkward, so this works for the example line, but may not work for every line. Adjust as necessary.
I have also not included (if so wanted), a conversion to a datetime.datetime object. For that, read through the datetime module documentation, in particular datetime.strptime. (Alternatively, you may want to store your results in a Pandas table. In that case, read through the Pandas documentation on how to convert strings to actual datetime objects.)
You also don't need to read nad split on newlines yourself: for line in f will do that for you (provided f is indeed a filehandle).

Regex is probably the way to go (fastness, efficiency etc.) ... but ...
You could take a very simplistic (if very inefficient) approach of cleaning your data:
join all of it into a string
replace things that hinder easy parsing
split wisely and filter the split
like so:
data = ['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
all_text = " ".join(data)
# this is inefficient and will create throwaway intermediate strings - if you are
# in a hurry or operate on 100s of MB of data, this is NOT the way to go, unless
# you have time
# iterate pairs of ("bad thing", "what to replace it with") (or list of bad things)
for thing in [ (": ",":"), (list('[]{}"'),"") ]:
whatt = thing[0]
withh = thing[1]
# if list, do so for each bad thing
if isinstance(whatt, list):
for p in whatt:
# replace it
all_text = all_text.replace(p,withh)
else:
all_text = all_text.replace(whatt,withh)
# format is now far better suited to splitting/filtering
cleaned = [a for a in all_text.split(" ")
if any(a.startswith(prefix) or "Switch" in a
for prefix in {"In:","Switch:","Out:"})]
print(cleaned)
Outputs:
['Out:2020-01-31T00:30:20.150Z', 'Type:Switch', 'In:2020-01-31T00:30:20.140Z']
After cleaning your data would look like:
2020-01-31T12:04:57.976Z 1234 Out:2020-01-31T00:30:20.150Z Id:Id:4-f-4-9-6a Type:Switch In:2020-01-31T00:30:20.140Z
You can transform the clean list into a dictionary for ease of lookup:
d = dict( part.split(":",1) for part in cleaned)
print(d)
will produce:
{'In': '2020-01-31T00:30:20.140Z',
'Type': 'Switch',
'Out': '2020-01-31T00:30:20.150Z'}
You can use datetime module to parse the times from your values as shown in 0 0 post.

How to get the same name with multiple value get unique results in Python

I have a large csv file that compares the URLs of my txt files
How to get the same name with multiple value get unique results in Python and Is there a way to better compare the speed of two files? because it has a minimum large csv file of 1 gb
file1.csv
[01/Nov/2019:09:54:26 +0900] ","","102.12.14.22","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","164.16.37.75","52.222.194.116","200","CONNECT","http://www.google.com:443","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","192.10.77.95","21.323.12.96","200","CONNECT","http://www.wakers.com/sg/wew/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","197.99.94.32","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","157.87.34.72","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
file2.txt
1 www.amazon.com shop
1 wakers.com shop
script:
import csv
with open("file1.csv", 'r') as f:
reader = csv.reader(f)
for k in reader:
ko = set()
srcip = k[2]
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
ko.add((war,srcip))
for to in ko:
with open("file2.txt", "r") as f:
all_val = set()
for i in f:
val = i.strip().split(" ")[1]
if val in to[0]:
all_val.add(to)
for ki in all_val:
print(ki)
my output:
('www.amazon.com', '102.12.14.22')
('www.amazon.com', '167.27.14.62')
('www.wakers.com', '192.10.77.95')
('www.amazon.com', '167.27.14.62')
('www.amazon.com', '197.99.94.32')
('www.amazon.com', '157.87.34.72')
how to get if the url is the same, get the total value with a unique value
how to get results like this?
amazon.com 102.12.14.22
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com 192.10.77.95

Short answer: you can't directly do so. Well you can but with low performances.
CSV is a good storing format but if you want to do something like that you might want to store everything in another custom data file. you could first parse your file to have only Unique IDs instead of long strings (like amazon = 0, wakers = 1 and so on) to perform better and reduce compare cost.
The thing is, those thing are pretty bad for variable csv, memory mapping or building a database from your csv might also be great though (and making the changes on the database, only dumping the csv when you need to)
look at: How do quickly search through a .csv file in Python for a more complete answer.

Problem solution
import csv
import re
def possible_urls(filename, category, category_position, url_position):
# Here we will read a txt file to create a list of domains, that could correspond to shops
domains = []
with open(filename, "r") as file:
file_content = file.read().splitlines()
for line in file_content:
info_in_line = line.split(" ")
# Here i use a regular expression, to prase domain from url.
domain = re.sub('www.', '', info_in_line[url_position])
if info_in_line[category_position] == category:
domains.append(domain)
return domains
def read_from_csv(filename, ip_position, url_position, possible_domains):
# Here we will create a dictionary, where will
# all ips that this domain can have.
# Dictionary will look like this:
# {domain_name: [list of possible ips]}
domain_ip = {domain: [] for domain in possible_domains}
with open(filename, 'r') as f:
reader = csv.reader(f)
for line in reader:
if len(line) < max(ip_position, url_position):
print(f'Not enough items in line {line}, to obtain url or ip')
continue
ip = line[ip_position]
url = line[url_position]
# Using python regular expression to get a domain name
# from url.
domain = re.search('//[w]?[w]?[w]?\.?(.[^/]*)[:|/]', url).group(1)
if domain in domain_ip.keys():
domain_ip[domain].append(ip)
return domain_ip
def print_fomatted_result(result):
# Prints formatted result
for shop_domain in result.keys():
print(f'{shop_domain}: ')
for shop_ip in result[shop_domain]:
print(f' {shop_ip}')
def create_list_of_shops():
# Function that first creates a list of possible domains, and
# then read ip for that domains from csv
possible_domains = possible_urls('file2.txt', 'shop', 2, 1)
shop_domains_with_ip = read_from_csv('file1.csv', 2, 6, possible_domains)
# Display result, we get in previous operations
print(shop_domains_with_ip)
print_fomatted_result(shop_domains_with_ip)
create_list_of_shops()
Output
Dictionary of ip's where domains are keys, so you can get all possible ip's for domain by giving a name of that domain:
{'amazon.com': ['102.12.14.22', '167.27.14.62', '167.27.14.62', '197.99.94.32', '157.87.34.72'], 'wakers.com': ['192.10.77.95']}
amazon.com:
102.12.14.22
167.27.14.62
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com:
192.10.77.95
Regular expressions
A very useful thing you can learn from the solution is regular expressions. Regular expressions are tools that allow you to filter or retrieve information from lines in a very convenient way. It also greatly reduces the amount of code, which makes the code more readable and safe.
Let's consider your code of removing ports from strings and think how we can replace it with regex.
lines = url.replace(":443", "").replace(":8080", "")
Replacing of ports in such way is vulnerable, because you never can be sure, what port numbers can actually be in url. What if there will appear port number 5460, or port number 1022, etc. For each of such ports you will add new replaces and soon your code will look something like this
lines = url.replace(":443", "").replace(":8080", "").replace(":5460","").replace(":1022","")...
Not very readable. But with regular experssion you can describe a pattern. And the great news is that we actually know pattern for url with port numbers. They all looking like this:
:some_digits. So if we know pattern we can describe it with regular expression, and tell python to find everything, that match it and replace with empty string '':
re.sub(':\d+', '', url)
It tells to python regular expression engine:
Look for all digits in string url, that goes after : and replace them with empty string. This solution is shorter, safer and a way more readable then solution with replace chain, so I suggest you to read about them a little. Great resource to learn about regular expressions is
this site. Here you can test your regex.
Explanation of Regular expressions in code
re.sub('www.', '', info_in_line[url_position])
Look for all www. in string info_in_line[url_position] and replace it with empty string.
re.search('www.(.[^/]*)[:|/]', url).group(1)
Let's split it on parts:
[^/] - here could be everything except /
(.[^/]*) - Here i used match group. It tells to engine where solution we intersted in will be.
[:|/] - it means characters that could stay on that place. Long story short: after capturing group could be : or(|) /.
So summarizing. Regex can be expressed in words as follows:
Find all substrings, that starts with www., and ends with : or \ and return me everything that stadns between them.
group(1) - means get the first match.
Hope answer will be helpful!

If you used the URL as the key in a dictionary, and had your IP address sets as the elements of the dictionary, would that achieve what you intended?
my_dict = {
'www.amazon.com' = {
'102.12.14.22',
'167.27.14.62',
'197.99.94.32',
'157.87.34.72',
},
'www.wakers.com' = {'192.10.77.95'},
}

## I have used your code & Pandas to get your desired output
## Copy paste the code & execute to get the result
import csv
url_dict = {}
## STEP 1: Open file2.txt to get url names
with open("file2.txt", "r") as f:
for i in f:
val = i.strip().split(" ")[1]
url_dict[val] = []
## STEP 2: 2.1 Open csv file 'file1.csv' to extract url name & ip address
## 2.2 Check if url from file2.txt is available from the extracted url from 'file1.csv'
## 2.3 Create a dictionary with the matched url & its ip address
## 2.4 Remove duplicates in ip addresses from same url
with open("file1.csv", 'r') as f: ## 2.1
reader = csv.reader(f)
for k in reader:
#ko = set()
srcip = k[2]
#print(srcip)
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
for key, value in url_dict.items():
if key in war: ## 2.2
url_dict[key].append(srcip) ## 2.3
## 2.4
for key, value in url_dict.items():
url_dict[key] = list(set(value))
## STEP 3: Print dictionary output to .TXT file
file3 = open('output_text.txt', 'w')
for key, value in url_dict.items():
file3.write('\n' + key + '\n')
for item in value:
file3.write(' '*15 + item + '\n')
file3.close()

Storing Data With Python in Text Files

Similar questions have been asked but none quite like this.
I need to save 2 pieces of information in a text file, the username and their associated health integer. Now I need to be able to look into the file and see the user and then see what value is connected with it. Writing it the first time I plan to use open('text.txt', 'a') to append the new user and integers to the end of the txt file.
my main problem is this, How do I figure out which value is connected to a user string? If they're on the same line can I do something like read the only the number in that line?
What are your guys' suggestions? If none of this works, I guess I'll need to move over to json.

This may be what you're looking for. I'd suggest reading one line at a time to parse through the text file.
Another method would be to read the entire txt and separate strings using something like text_data.split("\n"), which should work if the data is separated by line (denoted by '\n').

You're probably looking for configparser which is designed for just that!
Construct a new configuration
>>> import configparser
>>> config = configparser.ConfigParser()
>>> config.sections()
[]
>>> config['Players'] = {
... "ti7": 999,
... "example": 50
... }
>>> with open('example.cfg', 'w') as fh:
... config.write(fh) # write directly to file handler
...
Now read it back
>>> import configparser
>>> config = configparser.ConfigParser()
>>> config.read("example.cfg")
['example.cfg']
>>> print(dict(config["Players"]))
{'ti7': '999', 'example': '50'}
Inspecting the written file
% cat example.cfg
[Players]
ti7 = 999
example = 50

If you already have a text config written in the form key value in each line, you can probably parse your config file as follows:
user_healths = {} # start empty dictionary
with open("text.txt", 'r') as fh: # open file for reading
for line in fh.read().strip().split('\n'): # list lines, ignore last empty
user, health = line.split(maxsplit=1) # "a b c" -> ["a", "b c"]
user_healths[user] = int(health) # ValueError if not number
Note that this will make the user's health the last value listed in text.txt if it appears multiple times, which may be what you want if you always append to the file
% cat text.txt
user1 100
user2 150
user1 200
Parsing text.txt above:
>>> print(user_healths)
{'user1': 200, 'user2': 150}

Python script for extract test from log files

The problem statement as follows, There is log file contains logs related to testing results. For example it is contains text like testcase1 followed by logs for test case and testcase2 followed by logs for the test case and so on.
If user want to extract log for testcase1 and testcase3, script should read input from user like testcase1 and testcase3. Then extract only the logs for specified testcases.
In this case, assume user enter testcase1 and testcase3, output should be lines below the testcase1 and testcase3 from the log file.

Here you have to go through an assumption that all the testcase logs are in separate lines, which means you are having '\n' after every log line.
Then you can read the files from linecache module.
Now again you should have your logs in a particular format. Here you have mentioned it as [testcaseN] [Log message] and [testcaseN] should have 'testcase' common with 'N' as variable.
So while you fetch all lines using linecache module, use re module for matching testcaseN given as input with the first word of individual lines fetched . Once you get a match, display the result.

Finally got the woking script to extract the text
# Read particular line from the text file
# For example in the program we are readin file and print only lines under the Title 2 and title 4
# log file may contain empty lines as weel
# Example log file
# Title
# 1
# dklfjsdkl;
# g
# sdfzsdfsdf
# sdfsdfsdf
# dsfsdfsd
# dfsdf
#
# title
# 2
#
# dfdf
# dfdf
# dfdf
# df
# dfd
# d
#
# title3
# sdfdfd
# dfdfd
# dfd
#
# dfd
#
# title
# 4
# dfkdfkd
# dfdkjmd
# dfdkljm
in_list= []
while True:
i = raw_input("Enter title to be extracted (or Enter to quit): ")
in_list.append(i)
if not i:
break
print("Your input:", i)
print("While loop has exited")
in_list.remove(i)
print "Input list", in_list
flist = []
with open("C:\\text.txt", 'r') as inp:
#read the flie and storing into the list
flist =inp.readlines()
inp.close()
#making everything in the list to lower case
flist = map(lambda x:x.lower(),flist)
flist = [s.strip("\n") for s in flist]
print flist
# printing the complete log file from the list. Since once we put the vlaue in the list the new line character will be \ appended in the list element.
#hence striping with \n character
# for i in flist:
# print i.strip("\\n")
for j in range(len(in_list)):
result = any(in_list[j] in word for word in flist)
if result:
i_index = flist.index(in_list[j])
flag = 0
with open("C:\\output.txt",'a') as f1:
f1.write(flist[i_index])
f1.write("\n")
while flag ==0:
if "title" in flist[i_index+1]:
flag =1
else:
i_index += 1
f1.write(flist[i_index])
f1.write("\n")
i_index += 1
f1.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python read tsv file and evaluate - python

Related

How to create multiline lists from raw file and process them in python?

Extract time values from a list and add to a new list or array

How to get the same name with multiple value get unique results in Python

Storing Data With Python in Text Files

Python script for extract test from log files

Categories

Resources