I have a matrix written in this format inside a log file:
2014-09-08 14:10:20,107 - root - INFO - [[ 8.30857546 0.69993454 0.20645551
77.01797674 13.76705776]
[ 8.35205432 0.53417203 0.19969048 76.78598173 14.12810144]
[ 8.37066492 0.64428449 0.18623849 76.4181809 14.3806312 ]
[ 8.50493296 0.5110043 0.19731849 76.45838604 14.32835821]
[ 8.18900791 0.4955451 0.22524777 76.96966663 14.12053259]]
...some text
2014-09-08 14:12:22,211 - root - INFO - [[ 3.25142253e+01 1.11788106e+00 1.51065008e-02 6.16496299e+01
4.70315726e+00]
[ 3.31685887e+01 9.53522041e-01 1.49767860e-02 6.13449154e+01
4.51799710e+00]
[ 3.31101827e+01 1.09729703e+00 5.03347259e-03 6.11818594e+01
4.60562742e+00]
[ 3.32506957e+01 1.13837592e+00 1.51783456e-02 6.08651657e+01
4.73058437e+00]
[ 3.26809490e+01 1.06617279e+00 1.00110121e-02 6.17429172e+01
4.49994994e+00]]
I am writing this matrix using the python logging package:
logging.info(conf_mat)
However, logging.info does not show me a method to write the matrix in a float %.3f format. So I decided to parse the log file this way:
conf_mat = [[]]
cf = '[+-]?(?=\d*[.eE])(?=\.?\d)\d*\.?\d*(?:[eE][+-]?\d+)?'
with open(sys.argv[1]) as f:
for line in f:
epoch = re.findall(ep, line) # find lines starting with epoch for other stuff
if epoch:
error_line = next(f) # grab the next line, which is the error line
error_value = error_line[error_line.rfind('=')+1:]
data_points.append(map(float,epoch[0]+(error_value,))) #get the error value for the specific epoch
for i in range(N):
cnf_mline = next(f)
match = re.findall(cf, cnf_mline)
if match:
conf_mat[count].append(map(float,match))
else:
conf_mat.append([])
count += 1
However, the regex does not catch the break in the line when looking at the matrix, when I try to convert the matrix using
conf_mtx = np.array(conf_mat)
Your regex string cf needs to be a raw string literal:
cf = r'[+-]?(?=\d*[.eE])(?=\.?\d)\d*\.?\d*(?:[eE][+-]?\d+)?'
in order to work properly. Backslash \ characters are interpreted as escape sequences in "regular" strings, but should not be in regexes. You can read about raw string literals at the top of the re module's documentation, and in this excellent SO answer. Alex Martelli explains them quite well, so I won't repeat everything he says here. Suffice it to say that were you not to use a raw literal, you'd have to escape each and every one of your backslashes with another backslash, and that just gets ugly and annoying fast.
As for the rest of your code, it won't run without more information. The N in for i in range(N): is undefined, as is count a few lines later. Calling cnf_mline = next(f) really doesn't make sense at all, because you're going to run out of lines in the file (by calling next repeatedly) before you can iterate over all of them using the for line in f: command. It's unclear whether your data really has that line break in the second half where one of the members of the list is on the next line, I assume that's the case because of the next attempt.
I think you should first try to clean up your input file into a regular format, then you'll have a much easier time running regular expressions on it. In order to work on subsequent lines and not run out your generator expression with excessive uses of next(), check out itertools.tee(). It returns n independent generators from a single iterable, allowing you to advance the second a line ahead of the first. Alternatively, you could read your file's lines into a list, and just operate using indices of i, i+1. Just strip each line, join them together, and write to a new file or list. You can then go ahead and rewrite your matching loop to simply pull each number of the appropriate format out and insert it into your matrix at the correct position. The good news is your regex caught everything I threw at it, so you won't need to modify anything there.
Good luck!
Related
Python: 2.7.9
I erased all of my code because I'm going nuts.
Here's the gist (its for Rosalind challenge thingy):
I want to take a file that looks like this (no quotes on carets)
">"Rosalind_0304
actgatcgtcgctgtactcg
actcgactacgtagctacgtacgctgcatagt
">"Rosalind_2480
gctatcggtactgcgctgctacgtg
ccccccgaagaatagatag
">"Rosalind_2452
cgtacgatctagc
aaattcgcctcgaactcg
etc...
What I can't figure out how to do is basically everything at this point, my mind is so muddled. I'll just show kind of what I was doing, but failing to do.
1st. I want to search the file for '>'
Then assign the rest of that line into the dictionary as a key.
read the next lines up until the next '>' and do some calculations and return
findings into the value for that key.
go through the file and do it for every string.
then compare all values and return the key of whichever one is highest.
Can anyone help?
It might help if I just take a break. I've been coding all day and i think I smell colors.
def func(dna_str):
bla
return gcp #gc count percentage returned to the value in dict
With my_function somewhere that returns that percentage value:
with open('rosalind.txt', 'r') as ros:
rosa = {line[1:].split(' ')[0]:my_function(line.split(' ')[1].strip()) for line in ros if line.strip()}
top_key = max(rosa, key=rosa.get)
print(top_key, rosa.get(top_key))
For each line in the file, that will first check if there's anything left of the line after stripping trailing whitespace, then discard the blank lines. Next, it adds each non-blank line as an entry to a dictionary, with the key being everything to the left of the space except for the unneeded >, and the value being the result of sending everything to the right of the space to your function.
Then it saves the key corresponding to the highest value, then prints that key along with its corresponding value. You're left with a dictionary rosa that you can process however you like.
Complete code of the module:
def my_function(dna):
return 100 * len(dna.replace('A','').replace('T',''))/len(dna)
with open('rosalind.txt', 'r') as ros:
with open('rosalind_clean.txt', 'w') as output:
for line in ros:
if line.startswith('>'):
output.write('\n'+line.strip())
elif line.strip():
output.write(line.strip())
with open('rosalind_clean.txt', 'r') as ros:
rosa = {line[1:].split(' ')[0]:my_function(line.split(' ')[1].strip()) for line in ros if line.strip()}
top_key = max(rosa, key=rosa.get)
print(top_key, rosa.get(top_key))
Complete content of rosalind.txt:
>Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCG
TTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG
>Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCA
GGCGCTCCGCCGAAGGTCTATATCCA
TTTGTCAGCAGACACGC
>Rosalind_0808 CCACCCTCGTGGT
ATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGACTGGGAACCTGCGGGCAGTAGGTGGAAT
Result when running the module:
Rosalind_0808 60.91954022988506
This should properly handle an input file that doesn't necessarily have one entry per line.
See SO's formatting guide to learn how to make inline or block code tags to get past things like ">". If you want it to appear as regular text rather than code, escape the > with a backslash:
Type:
\>Rosalind
Result:
>Rosalind
I think I got that part down now. Thanks so much. BUUUUT. Its throwing an error about it.
rosa = {line[1:].split(' ')[0]:calc(line.split(' ')[1].strip()) for line in ros if line.strip()}
IndexError: list index out of range
this is my func btw.
def calc(dna_str):
for x in dna_str:
if x == 'G':
gc += 1
divc += 1
elif x == 'C':
gc += 1
divc += 1
else:
divc += 1
gcp = float(gc/divc)
return gcp
Exact test file. no blank lines before or after.
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
I have checked and played with various examples and it appears that my problem is a bit more complex than what I have been able to find. What I need to do is search for a particular string and then delete the following line and keep deleting lines until another string is found. So an example would be the following:
a
b
color [
0 0 0,
1 1 1,
3 3 3,
] #color
y
z
Here, "color [" is match1, and "] #color" is match2. So then what is desired is the following:
a
b
color [
] #color
y
z
This "simple to follow" code example will get you started .. you can tweak it as needed. Note that it processes the file line-by-line, so this will work with any size file.
start_marker = 'startdel'
end_marker = 'enddel'
with open('data.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if end_marker in line:
ignoreLines = False
if not ignoreLines:
print line,
It uses startdel and enddel as "markers" for starting and ending the ignoring of data.
Update:
Modified code based on a request in the comments, this will now include/print the lines that contain the "markers".
Given this input data (borrowed from #drewk):
Beginning of the file...
stuff
startdel
delete this line
delete this line also
enddel
stuff as well
the rest of the file...
it yields:
Beginning of the file...
stuff
startdel
enddel
stuff as well
the rest of the file...
You can do this with a single regex by using nongreedy *. E.g., assuming you want to keep both the "look for this line" and the "until this line is found" lines, and discard only the lines in between, you could do:
>>> my_regex = re.compile("(look for this line)"+
... ".*?"+ # match as few chars as possible
... "(until this line is found)",
... re.DOTALL)
>>> new_str = my_regex.sub("\1\2", old_str)
A few notes:
The re.DOTALL flag tells Python that "." can match newlines -- by default it matches any character except a newline
The parentheses define "numbered match groups", which are then used later when I say "\1\2" to make sure that we don't discard the first and last line. If you did want to discard either or both of those, then just get rid of the \1 and/or the \2. E.g., to keep the first but not the last use my_regex.sub("\1", old_str); or to get rid of both use my_regex.sub("", old_str)
For further explanation, see: http://docs.python.org/library/re.html or search for "non-greedy regular expression" in your favorite search engine.
This works:
s="""Beginning of the file...
stuff
look for this line
delete this line
delete this line also
until this line is found
stuff as well
the rest of the file... """
import re
print re.sub(r'(^look for this line$).*?(^until this line is found$)',
r'\1\n\2',s,count=1,flags=re.DOTALL | re.MULTILINE)
prints:
Beginning of the file...
stuff
look for this line
until this line is found
stuff as well
the rest of the file...
You can also use list slices to do this:
mStart='look for this line'
mStop='until this line is found'
li=s.split('\n')
print '\n'.join(li[0:li.index(mStart)+1]+li[li.index(mStop):])
Same output.
I like re for this (being a Perl guy at heart...)
I am using this code to find difference between two csv list and hove some formatting questions. This is probably an easy fix, but I am new and trying to learn and having alot of problems.
import difflib
diff=difflib.ndiff(open('test1.csv',"rb").readlines(), open('test2.csv',"rb").readlines())
try:
while 1:
print diff.next(),
except:
pass
the code works fine and I get the output I am looking for as:
Group,Symbol,Total
- Adam,apple,3850
? ^
+ Adam,apple,2850
? ^
bob,orange,-45
bob,lemon,66
bob,appl,-56
bob,,88
My question is how do I clean the formatting up, can I make the Group,Symbol,Total into sperate columns, and the line up the text below?
Also can i change the ? to represent a text I determine? such as test 1 and test 2 representing which sheet it comes from?
thanks for any help
Using difflib.unified_diff gives much cleaner output, see below.
Also, both difflib.ndiff and difflib.unified_diff return a Differ object that is a generator object, which you can directly use in a for loop, and that knows when to quit, so you don't have to handle exceptions yourself. N.B; The comma after line is to prevent print from adding another newline.
import difflib
s1 = ['Adam,apple,3850\n', 'bob,orange,-45\n', 'bob,lemon,66\n',
'bob,appl,-56\n', 'bob,,88\n']
s2 = ['Adam,apple,2850\n', 'bob,orange,-45\n', 'bob,lemon,66\n',
'bob,appl,-56\n', 'bob,,88\n']
for line in difflib.unified_diff(s1, s2, fromfile='test1.csv',
tofile='test2.csv'):
print line,
This gives:
--- test1.csv
+++ test2.csv
## -1,4 +1,4 ##
-Adam,apple,3850
+Adam,apple,2850
bob,orange,-45
bob,lemon,66
bob,appl,-56
So you can clearly see which lines were changed between test1.csv and test1.csv.
To line up the columns, you must use string formatting.
E.g. print "%-20s %-20s %-20s" % (row[0],row[1],row[2]).
To change the ? into any text test you like, you'd use s.replace('any text i like').
Your problem has more to do with the CSV format, since difflib has no idea it's looking at columnar fields. What you need is to figure out into which field the guide is pointing, so that you can adjust it when printing the columns.
If your CSV files are simple, i.e. they don't contain any quoted fields with embedded commas or (shudder) newlines, you can just use split(',') to separate them into fields, and figure out where the guide points as follows:
def align(line, guideline):
"""
Figure out which field the guide (^) points to, and the offset within it.
E.g., if the guide points 3 chars into field 2, return (2, 3)
"""
fields = line.split(',')
guide = guideline.index('^')
f = p = 0
while p + len(fields[f]) < guide:
p += len(fields[f]) + 1 # +1 for the comma
f += 1
offset = guide - p
return f, offset
Now it's easy to show the guide properly. Let's say you want to align your columns by printing everything 12 spaces wide:
diff=difflib.ndiff(...)
for line in diff:
code = line[0] # The diff prefix
print code,
if code == '?':
fld, offset = align(lastline, line[2:])
for f in range(fld):
print "%-12s" % '',
print ' '*offset + '^'
else:
fields = line[2:].rstrip('\r\n').split(',')
for f in fields:
print "%-12s" % f,
print
lastline = line[2:]
Be warned that the only reliable way to parse CSV files is to use the csv module (or a robust alternative); but getting it to play well with the diff format (in full generality) would be a bit of a headache. If you're mainly interested in readability and your CSV isn't too gnarly, you can probably live with an occasional mix-up.
I have a tab delimited text file with the following data:
ahi1
b/se
ahi
test -2.435953
1.218364
ahi2
b/se
ahi
test -2.001858
1.303935
I want to extract the two floating point numbers to a separate csv file with two columns, ie.
-2.435953 1.218264
-2.001858 1.303935
Currently my hack attempt is:
import csv
from itertools import islice
results = csv.reader(open('test', 'r'), delimiter="\n")
list(islice(results,3))
print results.next()
print results.next()
list(islice(results,3))
print results.next()
print results.next()
Which is not ideal. I am a Noob to Python so I apologise in advance and thank you for your time.
Here is the code to do the job:
import re
# this is the same data just copy/pasted from your question
data = """ ahi1
b/se
ahi
test -2.435953
1.218364
ahi2
b/se
ahi
test -2.001858
1.303935"""
# what we're gonna do, is search through it line-by-line
# and parse out the numbers, using regular expressions
# what this basically does is, look for any number of characters
# that aren't digits or '-' [^-\d] ^ means NOT
# then look for 0 or 1 dashes ('-') followed by one or more decimals
# and a dot and decimals again: [\-]{0,1}\d+\.\d+
# and then the same as first..
pattern = re.compile(r"[^-\d]*([\-]{0,1}\d+\.\d+)[^-\d]*")
results = []
for line in data.split("\n"):
match = pattern.match(line)
if match:
results.append(match.groups()[0])
pairs = []
i = 0
end = len(results)
while i < end - 1:
pairs.append((results[i], results[i+1]))
i += 2
for p in pairs:
print "%s, %s" % (p[0], p[1])
The output:
>>>
-2.435953, 1.218364
-2.001858, 1.303935
Instead of printing out the numbers, you could save them in a list and zip them together afterwards..
I'm using the python regular expression framework to parse the text. I can only recommend you pick up regular expressions if you don't already know it. I find it very useful to parse through text and all sorts of machine generated output-files.
EDIT:
Oh and BTW, if you're worried about the performance, I tested on my slow old 2ghz IBM T60 laptop and I can parse a megabyte in about 200ms using the regex.
UPDATE:
I felt kind, so I did the last step for you :P
Maybe this can help
zip(*[results]*5)
eg
import csv
from itertools import izip
results = csv.reader(open('test', 'r'), delimiter="\t")
for result1, result2 in (x[3:5] for x in izip(*[results]*5)):
... # do something with the result
Tricky enough but more eloquent and sequential solution:
$ grep -v "ahi" myFileName | grep -v se | tr -d "test\" " | awk 'NR%2{printf $0", ";next;}1'
-2.435953, 1.218364
-2.001858, 1.303935
How it works: Basically remove specific text lines, then remove unwanted text in lines, then join every second line with formatting. I just added the comma for beautification purposes. Leave the comma out of awks printf ", " if you don't need it.
Hey there, I have a rather large file that I want to process using Python and I'm kind of stuck as to how to do it.
The format of my file is like this:
0 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
So I basically want to read in the chunk up from 0-1, do my processing on it, then move on to the chunk between 1 and 2.
So far I've tried using a regex to match the number and then keep iterating, but I'm sure there has to be a better way of going about this. Any suggestion/info would be greatly appreciated.
If they are all within the same line, that is there are no line breaks between "1." and "2." then you can iterate over the lines of the file like this:
for line in open("myfile.txt"):
#do stuff
The line will be disposed of and overwritten at each iteration meaning you can handle large file sizes with ease. If they're not on the same line:
for line in open("myfile.txt"):
if #regex to match start of new string
parsed_line = line
else:
parsed_line += line
and the rest of your code.
Why don't you just read the file char by char using file.read(1)?
Then, you could - in each iteration - check whether you arrived at the char 1. Then you have to make sure that storing the string is fast.
If the "N " can only start a line, then why not use use the "simple" solution? (It sounds like this already being done, I am trying to reinforce/support it ;-))
That is, just reading a line at a time, and build up the data representing the current N object. After say N=0, and N=1 are loaded, process them together, then move onto the next pair (N=2, N=3). The only thing that is even remotely tricky is making sure not to throw out a read line. (The line read that determined the end condition -- e.g. "N " -- also contain the data for the next N).
Unless seeking is required (or IO caching is disabled or there is an absurd amount of data per item), there is really no reason not to use readline AFAIK.
Happy coding.
Here is some off-the-cuff code, which likely contains multiple errors. In any case, it shows the general idea using a minimized side-effect approach.
# given an input and previous item data, return either
# [item_number, data, next_overflow] if another item is read
# or None if there are no more items
def read_item (inp, overflow):
data = overflow or ""
# this can be replaced with any method to "read the header"
# the regex is just "the easiest". the contract is just:
# given "N ....", return N. given anything else, return None
def get_num(d):
m = re.match(r"(\d+) ", d)
return int(m.groups(1)) if m else None
for line in inp:
if data and get_num(line) ne None:
# already in an item (have data); current line "overflows".
# item number is still at start of current data
return [get_num(data), data, line]
# not in item, or new item not found yet
data += line
# and end of input, with data. only returns above
# if a "new" item was encountered; this covers case of
# no more items (or no items at all)
if data:
return [get_num(data), data, None]
else
return None
And usage might be akin to the following, where f represents an open file:
# check for error conditions (e.g. None returned)
# note feed-through of "overflow"
num1, data1, overflow = read_item(f, None)
num2, data2, overflow = read_item(f, overflow)
If the format is fixed, why not just read 3 lines at a time with readline()
If the file is small, you could read the whole file in and split() on number digits (might want to use strip() to get rid of whitespace and newlines), then fold over the list to process each string in the list. You'll probably have to check that the resultant string you are processing on is not initially empty in case two digits were next to each other.
If the file's content can be loaded in memory, and that's what you answered, then the following code (needs to have filename defined) may be a solution.
import re
regx = re.compile('^((\d+).*?)(?=^\d|\Z)',re.DOTALL|re.MULTILINE)
with open(filename) as f:
text = f.read()
def treat(inp,regx=regx):
m1 = regx.search(inp)
numb,chunk = m1.group(2,1)
li = [chunk]
for mat in regx.finditer(inp,m1.end()):
n,ch = mat.group(2,1)
if int(n) == int(numb) + 1:
yield ''.join(li)
numb = n
li = []
li.append(ch)
chunk = ch
yield ''.join(li)
for y in treat(text):
print repr(y)
This code, run on a file containing :
1 mountain
orange 2
apple
produce
2 gas
solemn
enlightment
protectorate
3 grimace
song
4 snow
wheat
51 guludururu
kelemekinonoto
52asabi dabada
5 yellow
6 pink
music
air
7 guitar
blank 8
8 Canada
9 Rimini
produces:
'1 mountain\norange 2\napple\nproduce\n'
'2 gas\nsolemn\nenlightment\nprotectorate\n'
'3 grimace\nsong\n'
'4 snow\nwheat\n51 guludururu\nkelemekinonoto\n52asabi dabada\n'
'5 yellow\n'
'6 pink \nmusic\nair\n'
'7 guitar\nblank 8\n'
'8 Canada\n'
'9 Rimini'