I'm writing a script that parses log files and matches certain strings like "INFO", "WARN", "SEVERE", etc.
I can do this with out much trouble using the code below.
from sys import argv
from collections import OrderedDict
# Find and catalog each log line that matches these strings
match_strings = ["INFO", "WARN", "SEVERE"]
if len(argv) > 1:
files = argv[1:]
else:
print "ERROR: You must provide at least one log file to be processed."
print "Example:"
print "%s my.log" % argv[0]
exit(2)
for filename in files:
with open(filename) as f:
data = f.read().splitlines()
# Create data structure to handle results
matches = OrderedDict()
for string in match_strings:
matches[string] = []
for i, s in enumerate(data, 1):
for string in match_strings:
if string in s:
matches[string].append('Line %03d: %s' % (i, s,))
for string in matches:
print "\"%s\": %d" % (string, len(matches[string]))
Log files look like:
2014-05-26T15:06:14.597+0000 INFO...
2014-05-26T15:06:14.597+0000 WARN...
2014-05-27T15:06:14.597+0000 INFO...
2014-05-28T15:06:14.597+0000 SEVERE...
2014-05-29T15:06:14.597+0000 SEVERE...
Current output looks like:
"INFO": 2
"WARN": 1
"SEVERE": 2
However, what I'd rather do is have the script collate and print formatted output by date. So, rather than print a simple list (above) we could get something like the following using the sample from above:
Category 2014-05-26 2014-05-27 2014-05-28 2014-05-29
"INFO": 1 1 0 0
"WARN": 1 0 0 0
"SEVERE": 0 0 1 1
Are there any thoughts / suggestions how to accomplish this?
One way of doing this would be to make a class that has variables info, warn, and severe in it. Then make a dictionary where each element is this class with the key being the date. Then when you are parsing your log file you can just find the date and use that as the index for your dictionary and increment the info, warn, and severe as needed.
Related
I'm trying to run a python script to draw sequences from a separate file (merged.fas), in respect to a list (gene_fams_eggnog.txt) I have as output from another program.
The code is as follows:
from Bio import SeqIO
import os, sys, re
from collections import defaultdict
sequences = "merged.fas"
all_seqs = SeqIO.index(sequences, "fasta")
gene_fams = defaultdict(list)
gene_fams_file = open("gene_fams_eggnog.txt")
for line in gene_fams_file:
fields = re.split("\t", line.rstrip())
gene_fams[fields[0]].append[fields[1]]
for fam in gene_fams.keys():
output_filename = str(fam) + ".fasta"
outh = open(output_filename, "w")
for id in gene_fams[fam]:
if id in all_seqs:
outh.write(">" + all_seqs[id].description + "\n" + str(all_seqs[id].seq) + "\n")
else:
print "Uh oh! Sequence with ID " + str(id) + " is not in the all_seqs file!"
quit()
outh.close()
The list looks like this:
1 Saccharomycescerevisiae_DAA09367.1
1 bieneu_EED42827.1
1 Asp_XP_749186.1
1 Mag_XP_003717339.1
2 Mag_XP_003716586.1
2 Mag_XP_003709453.1
3 Asp_XP_749329.1
The field 0 denotes a grouping based by a similarity between the sequences. The script was meant to take all the sequences from merged.fas that correspond to the code in the field 1 and write them into a file base on field 0.
So in the case of the portion of the list I have shown, all the sequences that have a 1 in field 0 (Saccharomycescerevisiae_DAA09367.1, bieneu_EED42827.1, Asp_XP_749186.1, Mag_XP_003717339.1) would have been written into a file called 1.fasta. This should continue from 2.fasta-however many groups there are.
So this has worked, however it doesn't include all the sequences that are in the group, it'll only include the last one to be listed as a part of that group. Using my example above, I'd only have a file (1.fasta) with one sequence (Mag_XP_003717339.1), instead of all four.
Any and all help is appreciated,
Thanks,
JT
Although I didn't spot the cause of the issue you complained about, I'm surprised your code runs at all with this error:
gene_fams[fields[0]].append[fields[1]]
i.e. append[...] instead of append(...). But perhaps that's also, "not there in the actual script I'm running". I rewrote your script below, and it works fine for me. One issue was your use of the variable name id which is a Python builtin. You'll see I go to an extreme to avoid such errors:
from Bio import SeqIO
from collections import defaultdict
SEQUENCE_FILE_NAME = "merged.fas"
FAMILY_FILE_NAME = "gene_families_eggnog.txt"
all_sequences = SeqIO.index(SEQUENCE_FILE_NAME, "fasta")
gene_families = defaultdict(list)
with open(FAMILY_FILE_NAME) as gene_families_file:
for line in gene_families_file:
family_id, gene_id = line.rstrip().split()
gene_families[family_id].append(gene_id)
for family_id, gene_ids in gene_families.items():
output_filename = family_id + ".fasta"
with open(output_filename, "w") as output:
for gene_id in gene_ids:
assert gene_id in all_sequences, "Sequence {} is not in {}!".format(gene_id, SEQUENCE_FILE_NAME)
output.write(all_sequences[gene_id].format("fasta"))
I need to create a program that can take some text file called a fasta file and transform it to give the sequence_name, Domain_names, Start of Domain, end Of Domain.
So a fasta file is just a text file that looks like this
>MICE_8
ATTCGATCGATCGATTTCGATCGATCGATCGATCGGGATCGATCGATCGATCGATC
>MICE_59
ATTTTTCGGCATCGATAGCTAGCTAGCTAG
My program needs to take one command argument which is the file name of the fasta and give an output like this:
MICE_8 gnl|CDD|256537 819 923 gnl|CDD|260076 111 189 gnl|CDD|260056 4 93
MICE_59
here is a decription of the output for more information:
MICE_8 is the name of the first sequence in the fasta file
gnl|CDD|256537 is the name of the first protein domain
819 this is where the domain stats
923 this is where it ends
gnl|CDD|260076 is the name of the second protein domain for the first sequence and so on it starts at 111 and end at position 189.
Also since the last sequence did not get a hit the program still needs to display the name of the sequence.
OK so here is my code so far and what it outputs so far
import sys
import os
fastaname = sys.argv[1]
rpsblastname = "rpsblast.out"
cmd = "rpsblast+ -db /home/bryan/data/cdd/cdd -query %s -outfmt 6 -evalue 0.05 > %s" % (fastaname,rpsblastname)
os.system(cmd)
handle = open(rpsblastname, "r")
seqname = ""
for line in handle:
linearr = line.split()
# seqname = linearr [0]
domain = linearr[1]
start = linearr[6]
end = linearr[7]
# If sequence name is the same as last time, don't print it
if seqname == linearr[0]:
sys.stdout.write("%s %s %s" % (domain, start, end))
# Otherwise do print the sequence name, and update seqname
else:
seqname = linearr[0]
print
sys.stdout.write("%s %s %s %s" % (seqname,domain,start,end))
here is what my output looks like so far:
mel#roswald:~$ ./Domainfinder.py bioinformation.fasta
MICE_8 gnl|CDD|256537 819 923gnl|CDD|260076 111 189gnl|CDD|260056 4 93
The program i created is almost to the required specification. * only have 3 problems that * need be to address:
there is an extra space between where I run the program and the result
my program does not write out the name of the sequence which has zero hits
my program does not separate the domain names by a space.
the correct output should look like this
mel#roswald:~$ ./Domainfinder.py bioinformation.fasta
MICE_8 gnl|CDD|256537 819 923 gnl|CDD|260076 111 189 gnl|CDD|260056 4 93
MICE_59
solved the issue. Mainly what needed to be done is use a dictionary to hold the sequence names as keys instead of using lists. Than since dictionaries are random we need to be able to create a list from the dictionary to order the sequence names as they are read. Also we extract the sequence names from the rpsblast out. if anyone has any question feel free to pm.
I have two files A and B in FASTQ format, which are basically several hundred million lines of text organized in groups of 4 lines starting with an # as follows:
#120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
GCCAATGGCATGGTTTCATGGATGTTAGCAGAAGACATGAGACTTCTGGGACAGGAGCAAAACACTTCATGATGGCAAAAGATCGGAAGAGCACACGTCTGAACTCN
+120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
bbbeee_[_ccdccegeeghhiiehghifhfhhhiiihhfhghigbeffeefddd]aegggdffhfhhihbghhdfffgdb^beeabcccabbcb`ccacacbbccB
I need to compare the
5:1101:1156:2031#0/
part between files A and B and write the groups of 4 lines in file B that matched to a new file. I got a piece of code in python that does that, but only works for small files as it parses through the entire #-lines of file B for every #-line in file A, and both files contain hundreds of millions of lines.
Someone suggested that I should create an index for file B; I have googled around without success and would be very grateful if someone could point out how to do this or let me know of a tutorial so I can learn. Thanks.
==EDIT==
In theory each group of 4 lines should only exist once in each file. Would it increase the speed enough if breaking the parsing after each match or do I need a different algorithm altogether?
An index is just a shortened version of the information you are working with. In this case, you will want the "key" - the text between the first colon(':') on the #-line and the final slash('/') near the end - as well as some kind of value.
Since the "value" in this case is the entire contents of the 4-line block, and since our index is going to store a separate entry for each block, we would be storing the entire file in memory if we used the actual value in the index.
Instead, let's use the file position of the beginning of the 4-line block. That way, you can move to that file position, print 4 lines, and stop. Total cost is the 4 or 8 or however many bytes it takes to store an integer file position, instead of however-many bytes of actual genome data.
Here is some code that does the job, but also does a lot of validation and checking. You might want to throw stuff away that you don't use.
import sys
def build_index(path):
index = {}
for key, pos, data in parse_fastq(path):
if key not in index:
# Don't overwrite duplicates- use first occurrence.
index[key] = pos
return index
def error(s):
sys.stderr.write(s + "\n")
def extract_key(s):
# This much is fairly constant:
assert(s.startswith('#'))
(machine_name, rest) = s.split(':', 1)
# Per wikipedia, this changes in different variants of FASTQ format:
(key, rest) = rest.split('/', 1)
return key
def parse_fastq(path):
"""
Parse the 4-line FASTQ groups in path.
Validate the contents, somewhat.
"""
f = open(path)
i = 0
# Note: iterating a file is incompatible with fh.tell(). Fake it.
pos = offset = 0
for line in f:
offset += len(line)
lx = i % 4
i += 1
if lx == 0: # #machine: key
key = extract_key(line)
len1 = len2 = 0
data = [ line ]
elif lx == 1:
data.append(line)
len1 = len(line)
elif lx == 2: # +machine: key or something
assert(line.startswith('+'))
data.append(line)
else: # lx == 3 : quality data
data.append(line)
len2 = len(line)
if len2 != len1:
error("Data length mismatch at line "
+ str(i-2)
+ " (len: " + str(len1) + ") and line "
+ str(i)
+ " (len: " + str(len2) + ")\n")
#print "Yielding #%i: %s" % (pos, key)
yield key, pos, data
pos = offset
if i % 4 != 0:
error("EOF encountered in mid-record at line " + str(i));
def match_records(path, index):
results = []
for key, pos, d in parse_fastq(path):
if key in index:
# found a match!
results.append(key)
return results
def write_matches(inpath, matches, outpath):
rf = open(inpath)
wf = open(outpath, 'w')
for m in matches:
rf.seek(m)
wf.write(rf.readline())
wf.write(rf.readline())
wf.write(rf.readline())
wf.write(rf.readline())
rf.close()
wf.close()
#import pdb; pdb.set_trace()
index = build_index('afile.fastq')
matches = match_records('bfile.fastq', index)
posns = [ index[k] for k in matches ]
write_matches('afile.fastq', posns, 'outfile.fastq')
Note that this code goes back to the first file to get the blocks of data. If your data is identical between files, you would be able to copy the block from the second file when a match occurs.
Note also that depending on what you are trying to extract, you may want to change the order of the output blocks, and you may want to make sure that the keys are unique, or perhaps make sure the keys are not unique but are repeated in the order they match. That's up to you - I'm not sure what you're doing with the data.
these guys claim to parse a few gigs file while using a dedicated library, see http://www.biostars.org/p/15113/
fastq_parser = SeqIO.parse(fastq_filename, "fastq")
wanted = (rec for rec in fastq_parser if ...)
SeqIO.write(wanted, output_file, "fastq")
a better approach IMO would be to parse it once and load the data to some database instead of that output_file (i.e mysql) and latter run the queries there
I am using this code to find difference between two csv list and hove some formatting questions. This is probably an easy fix, but I am new and trying to learn and having alot of problems.
import difflib
diff=difflib.ndiff(open('test1.csv',"rb").readlines(), open('test2.csv',"rb").readlines())
try:
while 1:
print diff.next(),
except:
pass
the code works fine and I get the output I am looking for as:
Group,Symbol,Total
- Adam,apple,3850
? ^
+ Adam,apple,2850
? ^
bob,orange,-45
bob,lemon,66
bob,appl,-56
bob,,88
My question is how do I clean the formatting up, can I make the Group,Symbol,Total into sperate columns, and the line up the text below?
Also can i change the ? to represent a text I determine? such as test 1 and test 2 representing which sheet it comes from?
thanks for any help
Using difflib.unified_diff gives much cleaner output, see below.
Also, both difflib.ndiff and difflib.unified_diff return a Differ object that is a generator object, which you can directly use in a for loop, and that knows when to quit, so you don't have to handle exceptions yourself. N.B; The comma after line is to prevent print from adding another newline.
import difflib
s1 = ['Adam,apple,3850\n', 'bob,orange,-45\n', 'bob,lemon,66\n',
'bob,appl,-56\n', 'bob,,88\n']
s2 = ['Adam,apple,2850\n', 'bob,orange,-45\n', 'bob,lemon,66\n',
'bob,appl,-56\n', 'bob,,88\n']
for line in difflib.unified_diff(s1, s2, fromfile='test1.csv',
tofile='test2.csv'):
print line,
This gives:
--- test1.csv
+++ test2.csv
## -1,4 +1,4 ##
-Adam,apple,3850
+Adam,apple,2850
bob,orange,-45
bob,lemon,66
bob,appl,-56
So you can clearly see which lines were changed between test1.csv and test1.csv.
To line up the columns, you must use string formatting.
E.g. print "%-20s %-20s %-20s" % (row[0],row[1],row[2]).
To change the ? into any text test you like, you'd use s.replace('any text i like').
Your problem has more to do with the CSV format, since difflib has no idea it's looking at columnar fields. What you need is to figure out into which field the guide is pointing, so that you can adjust it when printing the columns.
If your CSV files are simple, i.e. they don't contain any quoted fields with embedded commas or (shudder) newlines, you can just use split(',') to separate them into fields, and figure out where the guide points as follows:
def align(line, guideline):
"""
Figure out which field the guide (^) points to, and the offset within it.
E.g., if the guide points 3 chars into field 2, return (2, 3)
"""
fields = line.split(',')
guide = guideline.index('^')
f = p = 0
while p + len(fields[f]) < guide:
p += len(fields[f]) + 1 # +1 for the comma
f += 1
offset = guide - p
return f, offset
Now it's easy to show the guide properly. Let's say you want to align your columns by printing everything 12 spaces wide:
diff=difflib.ndiff(...)
for line in diff:
code = line[0] # The diff prefix
print code,
if code == '?':
fld, offset = align(lastline, line[2:])
for f in range(fld):
print "%-12s" % '',
print ' '*offset + '^'
else:
fields = line[2:].rstrip('\r\n').split(',')
for f in fields:
print "%-12s" % f,
print
lastline = line[2:]
Be warned that the only reliable way to parse CSV files is to use the csv module (or a robust alternative); but getting it to play well with the diff format (in full generality) would be a bit of a headache. If you're mainly interested in readability and your CSV isn't too gnarly, you can probably live with an occasional mix-up.
This seems fairly trivial but I can't seem to work it out
I have a text file with the contents:
B>F
I am reading this with the code below, stripping the '>' and trying to convert the strings into their corresponding ASCII value, minus 65 to give me a value that will correspond to another list index
def readRoute():
routeFile = open('route.txt', 'r')
for line in routeFile.readlines():
route = line.strip('\n' '\r')
route = line.split('>')
#startNode, endNode = route
startNode = ord(route[0])-65
endNode = ord(route[1])-65
# Debug (this comment was for my use to explain below the print values)
print 'Route Entered:'
print line
print startNode, ',', endNode, '\n'
return[startNode, endNode]
However I am having slight trouble doing the conversion nicely, because the text file only contains one line at the moment but ideally I need it to be able to support more than one line and run an amount of code for each line.
For example it could contain:
B>F
A>D
C>F
E>D
So I would want to run the same code outside this function 4 times with the different inputs
Anyone able to give me a hand
Edit:
Not sure I made my issue that clear, sorry
What I need it do it parse the text file (possibly containing one line or multiple lines like above. I am able to do it for one line with the lines
startNode = ord(route[0])-65
endNode = ord(route[1])-65
But I get errors when trying to do more than one line because the ord() is expecting different inputs
If I have (below) in the route.txt
B>F
A>D
This is the error it gives me:
line 43, in readRoute endNode = ord(route[1])-65
TypeError: ord() expected a character, but string of length 2 found
My code above should read the route.txt file and see that B>F is the first route, strip the '>' - convert the B & F to ASCII, so 66 & 70 respectively then minus 65 from both to give 1 & 5 (in this example)
The 1 & 5 are corresponding indexes for another "array" (list of lists) to do computations and other things on
Once the other code has completed it can then go to the next line in route.txt which could be A>D and perform the above again
Perhaps this will work for you. I turned the fileread into a generator so you can do as you please with the parsed results in the for-i loop.
def readRoute(file_name):
with open(file_name, 'r') as r:
for line in r:
yield (ord(line[0])-65, ord(line[2])-65)
filename = 'route.txt'
for startnode, endnode in readRoute(filename):
print startnode, endnode
If you can't change readRoute, change the contents of the file before each call. Better yet, make readRoute take the filename as a parameter (default it to 'route.txt' to preserve the current behavior) so you can have it process other files.
What about something like this? It takes the routes defined in your file and turns them into path objects with start and end member variables. As an added bonus PathManager.readFile() allows you to load multiple route files without overwriting the existing paths.
import re
class Path:
def __init__(self, start, end):
self.start = ord(start) - 65 # Scale the values as desired
self.end = ord(end) - 65 # Scale the values as desired
class PathManager:
def __init__(self):
self.expr = re.compile("^([A-Za-z])[>]([A-Za-z])$") # looks for string "C>C"
# where C is a char
self.paths = []
def do_logic_routine(self, start, end):
# Do custom logic here that will execute before the next line is read
# Return True for 'continue reading' or False to stop parsing file
return True
def readFile(self, path):
file = open(path,"r")
for line in file:
item = self.expr.match(line.strip()) # strip whitespaces before parsing
if item:
'''
item.group(0) is *not* used here; it matches the whole expression
item.group(1) matches the first parenthesis in the regular expression
item.group(2) matches the second
'''
self.paths.append(Path(item.group(1), item.group(2)))
if not do_logic_routine(self.paths[-1].start, self.paths[-1].end):
break
# Running the example
MyManager = PathManager()
MyManager.readFile('route.txt')
for path in MyManager.paths:
print "Start: %s End: %s" % (path.start, path.end)
Output is:
Start: 1 End: 5
Start: 0 End: 3
Start: 2 End: 5
Start: 4 End: 3