Unable to create a dictionary - python

I am trying to write a script for log parsing.
I got a file in which logs are jumbled up. Every first line of a particular log will have time stamp so I want to sort them using that.
For e.g.
10:48 Start
.
.
10:50 start
.
.
10:42 start
First line will contain key word ‘Start’ and the time stamp. The lines between ‘Start’ and before next ‘start’ are one set. I want to sort all of these sets in log files based on their time stamp.
Code Logic:
I thought of creating dictionary, where I will pick this time and assign it as ‘key’ and the text in value for that log set. And then I will sort the ‘keys’ in dictionary and print their ‘values’ in that sorted order in a file.
However I am getting error “TypeError: unhashable type: 'list'”
write1 = False
x = 0
search3 = "start"
matched = dict()
matched = {}
# fo is a list which is defined elsewhre in the code.
for line in fo:
if search3 in line:
#got the Hello2 printed which indicates script enters this loop
print('hello2')
write1 = True
x +=1
time = line.split()[3]
name1 = [time.split(':')[0] +":"+time.split(':')[1] + ":"+ time.split(':')[2]]
matched[name1] = line
elif write1:
matched[name1] += line
print(matched.keys())
Please let me know if my logic and the way I am doing is correct?

You set name1 as a list. Lists aren't hashable, only tuples are. However, I assume that you want name1 to be a string so you just want to remove the brackets:
name1 = time.split(':')[0] +":"+time.split(':')[1] + ":"+ time.split(':')[2]

Related

How to convert a text file where data is separated by rows using ids, to an easy lookup (dictionary, json, etc.) preserving whitespacing/lines?

I have a text file that looks like this
start_id=372
text: this is some text with
vartions in
white spacing and such
END_OF_RECORD
start_id=3453
text: Continued for
this record
that has other
variations in whitespacing
END_OF_RECORD
I need to convert this such that I can easily access the data with the preserved whitespacing and lines.
So something like this
result = function('start_id=3453')
result
returns
text: Continued for
this record
that has other
variations in whitespacing
The reason I need to preserve the whitespacing is because I need to look stuff by span. So
result[11:14]
results in
Con
Strategies I have though up of:
I have an algorithm that goes down the lines and searches if the line starts with 'start_id'. When I do, I go down the line until I reach end of line or whitespace, and record this span into a dictionary key.
Then I go down the lines until I hit 'END_OF_RECORD'.
I then somehow copy the whole line span into the dictionary value for that key.
My concerns about this method if there are any edge cases I am not thinking of, and how to copy whole several lines into a python value.
That should be actually quite simple if you use regex... something like this:
import re
dictionary_of_records = {} # records will be stored here
recording = False # this will allow me to prevent starting new record while recording and will hold the id
# matchers
start = re.compile(r'start_id=(\d*)')
end = re.compile(r'END_OF_RECORD')
with open('stack.txt') as file:
for line in file.readlines():
if start.match(line):
if recording:
raise Exception('Attempting to create new record without ending previous one!')
print('Start... matched!')
recording = start.search(line).group(1) # save the id of recording
print(f'Starting record with id {recording}')
current_record_string = '' # make a empty string to save recording to
elif end.match(line):
print('End... matched!')
dictionary_of_records[recording] = current_record_string # save the entry to dict
recording = False # reset recording to False
continue
elif not recording:
continue
else:
current_record_string += line
print(dictionary_of_records)
IIUC:
data = {}
for line in open('/content/deidentified-medical-text-1.0/id.text').readlines():
if line.startswith('START_OF_RECORD='):
id_ = line.strip().split('=')[1]
lines = []
elif line.startswith('||||END_OF_RECORD'):
data[id_] = ''.join(lines)
id_ = None
lines = []
elif id_:
lines.append(line)
>>> data
{372: 'text: this is some text with\nvartions in\n\n\n white spacing and such\n',
3453: 'text: Continued for\n\nthis record\n that has other\n\n\nvariations in whitespacing\n'}
>>> data[3453][11:14]
'Con'

How to find the title of a file that sits in between title tags

I have some files that have "TITLE..." then have "JOURNAL..." followed directly afterward. The specific lines are varied and are not static per file. I am trying to pull all of the information that exists between "...TITLE..." and "...JOURNAL...". So far, I am able to only pull the line that contains "TITLE", but for some files, that spills onto the next line.
I deduced that I must use a=line.find("TITLE") and b=line.find("JOURNAL")
then set up a for loop of for i in range(a,b): which displays all of the numerical values of the strings from 698-768, but only displays the number instead of the string. How do I display the string? and how do I then, clean that up to not display "TITLE", "JOURNAL", and the whitespaces in between those two and the text I need? Thanks!
This is the one that displays the single line that "TITLE" exists on
def extract_title():
f=open("GenBank1.gb","r")
line=f.readline()
while line:
line=f.readline()
if "TITLE" in line:
line.strip("TITLE ")
print(line)
f.close()
extract_title()
This the the current block that displays all of thos enumbers in increasing order on seperate lines.
def extract_title():
f=open("GenBank1.gb","r")
line=f.read()
a=line.find("TITLE")
b=line.find("JOURNAL")
line.strip()
f.close()
if "TITLE" in line and "JOURNAL" in line:
for i in range(a,b):
print(i)
extract_title()
Currently, I have from 698-768 displayed like:
698
699
700
etc...
I want to first get them like, 698 699 700,
then convert them to their string value
then I want to understand how to strip the white spaces and the "TITLE" and "JOURNAL" values. Thanks!
I am not sure if I get what you want to achieve here but if I understood it correctly you have a string similar to this "TITLE 659 JOURNAL" and want to get the value in the middle ? If so you could use the slicing notation as such:
line = f.read()
a = line.find("TITLE") + 5 # Because find gives index of the start so we add length
b = line.find("JOURNAL")
value = line[a:b]
value = value.strip() # Strip whitespace
If we now were to return value or print it out we get:
'659'
Similar if you want to get the value after JOURNAL you could use slicing notation again:
idx = line.find("JOURNAL") + 7
value = line[idx:] # Start after JOURNAL till end of string
you don't need the loop. just use slicing:
line = 'fooTITLEspamJOURNAL'
start = line.find('TITLE') + 5 # 5 is len('TITLE')
end = line.find('JOURNAL')
print(line[start:end])
output
spam
another option is to split
print(line.split('TITLE')[1].split('JOURNAL')[0])
str.split() returns list. we use indexes to get the element we want.
in slow motion:
part2 = line.split('TITLE')[1]
title = part2.split('JOURNAL')[0]
print(title)

Python help. Finding largest value in a file and printing out value w name

I need to create a progtam that opens a file then reads the values inside the file and then prints out the name with the largest value.
The file contains the following info:
Juan,27
Joe,16
Mike,29
Roy,10
Now the code I have is as follows:
UserFile=input('enter file name')
FileOpen=open(User File,'r')
for lines in User File:
data=line.split(",")
name=data[0]
hrs=data[1]
hrs=int(hrs)
LHRS = 0
if hrs > LHRS:
LHRS = hrs
if LHRS == LHRS:
print('Person with largest hours is',name)
The following prints out :
Person with the largest hours is Juan
Person with the largest hours is Mike
How can I make it so it only prints out the true largest?
While your effort for a first timer is pretty impressive, what you're unable to do here is.. Keep track of the name WHILE keeping track of the max value! I'm sure it can be done in your way, but might I suggest an alternative?
import operator
Let's read in the file like how I've done. This is good practice, this method handles file closing which can be the cause of many problems if not done properly.
with open('/Users/abhishekbabuji/Desktop/example.txt', 'r') as fh:
lines = fh.readlines()
Now that I have each line in a list called lines, it also has this annoying \n in it. Let's replace that with empty space ''
lines = [line.replace("\n", "") for line in lines]
Now we have a list like this. ['Name1, Value1', 'Name2, Value2'..] What I intend to do now, is for each string item in my list, take the first part in as a key, and the integer portion of the second part as the value to my dictionary called example_dict. So in 'Name1, Value1', Name1 is the item in index 0 and Name2 is my item in index 1 when I turn this into a list like I've done below and added the key, value pair into the dictionary.
example_dict = {}
for text in lines:
example_dict[text.split(",")[0]] = int(text.split(",")[1])
print(example_dict)
Gives:
{'Juan': 27, 'Joe': 16, 'Mike': 29, 'Roy': 10}
Now, obtain the key whose value is max and print it.
largest_hour = max(example_dict.items(), key=operator.itemgetter(1))[1]
highest_key = []
for person, hours in example_dict.items():
if hours == largest_hour:
highest_key.append((person, hours))
for pair in highest_key:
print('Person with largest hours is:', pair[0])

python newbie - where is my if/else wrong?

Complete beginner so I'm sorry if this is obvious!
I have a file which is name | +/- or IG_name | 0 in a long list like so -
S1 +
IG_1 0
S2 -
IG_S3 0
S3 +
S4 -
dnaA +
IG_dnaA 0
Everything which starts with IG_ has a corresponding name. I want to add the + or - to the IG_name. e.g. IG_S3 is + like S3 is.
The information is gene names and strand information, IG = intergenic region. Basically I want to know which strand the intergenic region is on.
What I think I want:
open file
for every line, if the line starts with IG_*
find the line with *
print("IG_" and the line it found)
else
print line
What I have:
with open(sys.argv[2]) as geneInfo:
with open(sys.argv[1]) as origin:
for line in origin:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3]
for newline in geneInfo:
if re.match(nname, newline):
print("IG_"+newline)
else:
print(line)
where origin is the mixed list and geneInfo has only the names not IG_names.
With this code I end up with a list containing only the else statements.
S1 +
S2 -
S3 +
S4 -
dnaA +
My problem is that I don't know what is wrong to search so I can (attempt) to fix it!
Below is some step-by-step annotated code that hopefully does what you want (though instead of using print I have aggregated the results into a list so you can actually make use of it). I'm not quite sure what happened with your existing code (especially how you're processing two files?)
s_dict = {}
ig_list = []
with open('genes.txt', 'r') as infile: # Simulating reading the file you pass in sys.argv
for line in infile:
if line.startswith('IG_'):
ig_list.append(line.split()[0]) # Collect all our IG values for later
else:
s_name, value = line.split() # Separate out the S value and its operator
s_dict[s_name] = value.strip() # Add to dictionary to map S to operator
# Now you can go back through your list of IG values and append the appropriate operator
pulled_together = []
for item in ig_list:
s_value = item.split('_')[1]
# The following will look for the operator mapped to the S value. If it is
# not found, it will instead give you 'not found'
corresponding_operator = s_dict.get(s_value, 'Not found')
pulled_together.append([item, corresponding_operator])
print ('List structure')
print (pulled_together)
print ('\n')
print('Printout of each item in list')
for item in pulled_together:
print(item[0] + '\t' + item[1])
nname = name[:-3]
Python's slicing through list is very powerful, but can be tricky to understand correctly.
When you write [:-3], you take everything except the last three items. The thing is, if you have less than three element in your list, it does not return you an error, but an empty list.
I think this is where things does not work, as there are not much elements per line, it returns you an empty list. If you could tell what do you exactly want it to return there, with an example or something, it would help a lot, as i don't really know what you're trying to get with your slicing.
Does this do what you want?
from __future__ import print_function
import sys
# Read and store all the gene info lines, keyed by name
gene_info = dict()
with open(sys.argv[2]) as gene_info_file:
for line in gene_info_file:
tokens = line.split()
name = tokens[0].strip()
gene_info[name] = line
# Read the other file and lookup the names
with open(sys.argv[1]) as origin_file:
for line in origin_file:
if line.startswith("IG_"):
name = line.split("_")[1]
nname = name[:-3].strip()
if nname in gene_info:
lookup_line = gene_info[nname]
print("IG_" + lookup_line)
else:
pass # what do you want to do in this case?
else:
print(line)

IndexError: list index out of range - Can't get my txt file to save the output in Python

I'm writing a program that is referring to two text files. One stores the list of currencies in a sentence and one stores the current value. I want it to only store the value of the currency in the sentence once, but then stores each order so it can be replaced on screen. E.g. Sterling, Euro (in one file). 1,1,2 (in the order file). I keep getting the error message
IndexError: list index out of range
all_currency = []
value_of_currency = []
my_words_file = open("words.txt","r")
my_words_document = my_words_file.readline()
while my_words_document != "":
all_currency.append(my_words_document.strip())
my_words_document = my_words_file.readline()
print("output of the words in the saved list:",str(all_currency))
my_words_file.close()
positions = ""
my_numbers_file = open("numbers.txt","r")
value_of_currency = my_numbers_file.readline().strip().split()
my_numbers_file.close()
for mynumbers in value_of_currency:
positions += all_currency[int(mynumbers)]+" "
print("output of the positions in the saved list:",positions)
You are getting error because you are mainly indexing the currency first. Since in you case there is only 2 currency that's why trying to get the third index eg. arr[2] giving this error.
def read_words(words_file):
return [word for line in open(words_file, 'r') for word in line.strip().split(',')]
all_currency = read_words("a.txt")
print(all_currency)
value_of_currency = read_words("b.txt")
print(value_of_currency )
positions = ""
for mynumbers in value_of_currency:
positions += all_currency[int(mynumbers)]+" "
print("output of the positions in the saved list:",positions)
#a.txt file contains currency: "Sterling,Euro,Dollar"
#b.txt file contains currency: "1,1,2"
I actually didn't get what was your desired output but i fixed the error problem. As i said earlier if you need to make sure that the values do not cross index number. Hope you understand.

Categories