Python input/output more efficiently - python

I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra.
I have a catalog file which contains all the information I need, like the coordinates x, y, the radius r, the strength s, etc. The catalog also contains the information to target which file I am going to read (identified by n1, n2) and which spectra in that file I am going to use (identified by n3).
The code I have now is:
import numpy as np
from itertools import izip
import fitsio
x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
for line1, line2 in izip(file_ID,file_c):
parts1 = line1.split()
parts2 = line2.split()
n1.append(parts1[0])
n2.append(parts1[1])
n3.append(float(parts1[2]))
x.append(float(parts2[0]))
y.append(float(parts2[1]))
r.append(float(parts2[2]))
s.append(float(parts2[3]))
def data_analysis(idx_start,idx_end): #### loop over 10 million entries
data_stru = np.zeros((idx_end-idx_start), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])
for i in xrange(idx_start,idx_end)
filename = "../../../data/" + str(n1[i]) + "/spPlate-" + str(n1[i]) + "-" + str(n2[i]) + ".fits"
fits_spectra = fitsio.FITS(filename)
fluxx = fits_spectra[0][n3[i]-1:n3[i],0:4000] #### return a list of list
flux = fluxx[0]
hdu = fits_spectra[0].read_header()
wave_start = hdu['CRVAL1']
logwave = wave_start + 0.0001 * np.arange(4000)
wavegrid = np.power(10,logwave)
##### After I read the flux and the wavegrid, then I can do my following analysis.
##### save data to data_stru
##### Reading is the most time-consuming part of this code, my later analysis is not time consuming.
The problem is that the files are too big, there is no enough memory to load it at once, and my catalog is not structured such that all entries which will open the same file are grouped together. I wonder is there anyone who can offer some thoughts to split the large loop into two loops: 1) first loop over the files so that we can avoid repeatedly opening/reading files again and again, 2) loop over the entries which are going to use the same file.

If I understand your code correctly, n1 and n2 determine which file to open. So why do you not just lexsort them. You can then use itertools.groupby to group records with the same n1, n2. Here is a down-scaled proof of concept:
import itertools
n1 = np.random.randint(0, 3, (10,))
n2 = np.random.randint(0, 3, (10,))
mockdata = np.arange(10)+100
s = np.lexsort((n2, n1))
for k, g in itertools.groupby(zip(s, n1[s], n2[s]), lambda x: x[1:]):
# groupby groups the iterations i of its first argument
# (zip(...) in this case) by the result of applying the
# optional second argument (here lambda) to i.
# Here we use the lambda expression to remove si from the
# tuple (si, n1si, n2si) that zip produces because otherwise
# equal (n1si, n2si) pairs would still be treated as different
# because of the distinct si's. Hence no grouping would occur.
# Putting si in there in the first place is necessary, so we
# we can reference the other records of the corresponding row
# in the inner loop.
print(k)
for si, n1s, ns2 in g:
# si can be used to access the corresponding other records
print (si, mockdata[si])
Prints something like:
(0, 1)
4 104
(0, 2)
0 100
2 102
6 106
(1, 0)
1 101
(2, 0)
8 108
9 109
(2, 1)
3 103
5 105
7 107
You may want to include n3 in the lexsort, but not the grouping so you can process the files' content in order.

Related

Improve performance of dataframe-like structure

I'm facing a data-structure challenge regarding a process in my code code in which I need to count the frequency of strings in positive and negative examples.
It's a large bottleneck and I don't seem to be able to find a better solution.
I have to go through every long strings in the dataset, and extract substrings, of which I need to count the frequency. In a perfect world, a pandas dataframe of the following shape would be perfect:
At the end, the expected structure is something like
string | frequency positive | frequency negative
________________________________________________
str1 | 5 | 7
str2 | 2 | 4
...
However, for obvious performance limits, this is not acceptable.
My solution is to use a dictionary to track the rows, and a Nx2 numpy matrix to track the frequency. This is also done because after this, I need to have the frequency in a Nx2 numpy matrix anyway.
Currently, my solution is something like this:
str_freq = np.zeros((N, 2), dtype=np.uint32)
str_dict = {}
str_dict_counter = 0
for i, string in enumerate(dataset):
substrings = extract(string) # substrings is a List[str]
for substring in substrings:
row = str_dict.get(substring, None)
if row is None:
str_dict[substring] = str_dict_counter
row = str_dict_counter
str_dict_counter += 1
str_freq[row, target[i]] += 1 # target[i] is equal to 1 or 0
However, it is really the bottleneck of my code, and I'd like to speed it up.
Some things about this code are imcompressible, for instance the extract(string), so that loop has to remain. However, if possible there is no problem with using parallel processing.
What I'm wondering especially is if there is a way to improve the inner loop. Python is known to be bad with loops, and this one seems a bit pointless, however since we can't (to my knowledge) do multiple get and sets to dictionaries like we can with numpy arrays, I don't know how I could improve it.
What do you suggest doing? Is the only solution to re-write in some lower level language?
I also though about using SQL-lite, however I don't know if it's worth it.
For the record, this has to take in about 10MB of data, it currently takes about 45 seconds, but needs to be done repeatedly with new data each time.
EDIT: Added example to test yourself
import random
import string
import re
import numpy as np
import pandas as pd
def get_random_alphaNumeric_string(stringLength=8):
return bytes(bytearray(np.random.randint(0,256,stringLength,dtype=np.uint8)))
def generate_dataset(n=10000):
d = []
for i in range(n):
rnd_text = get_random_alphaNumeric_string(stringLength=1000)
d.append(rnd_text)
return d
def test_dict(dataset):
pattern = re.compile(b"(q.{3})")
target = np.random.randint(0,2,len(dataset))
str_freq = np.zeros((len(dataset)*len(dataset[0]), 2), dtype=np.uint32)
str_dict = {}
str_dict_counter = 0
for i, string in enumerate(dataset):
substrings = pattern.findall(string) # substrings is a List[str]
for substring in substrings:
row = str_dict.get(substring, None)
if row is None:
str_dict[substring] = str_dict_counter
row = str_dict_counter
str_dict_counter += 1
str_freq[row, target[i]] += 1 # target[i] is equal to 1 or 0
return str_dict, str_freq[:str_dict_counter,:]
def test_df(dataset):
pattern = re.compile(b"(q.{3})")
target = np.random.randint(0,2,len(dataset))
df = pd.DataFrame(columns=["str","pos","neg"])
df.astype(dtype={"str":bytes, "pos":int, "neg":int}, copy=False)
df = df.set_index("str")
for i, string in enumerate(dataset):
substrings = pattern.findall(string) # substrings is a List[str]
for substring in substrings:
check = substring in df.index
if not check:
row = [0,0]
row[target[i]] = 1
df.loc[substring] = row
else:
df.loc[substring][target[i]] += 1
return df
dataset = generate_dataset(1000000)
d,f = test_dict(dataset) # takes ~10 seconds on my laptop
# to get the value of some key, say b'q123'
f[d[b'q123'],:]
d = test_df(dataset) # takes several minutes (hasn't finished yet)
# the same but with a dataframe
d.loc[b'q123']

I'm having trouble with my program running too long. I'm not sure if it's running infinitely or if it's just really slow

I'm writing a code to analyze a (8477960, 1) column vector. I am not sure if the while loops in my code are running infinitely, or if the way I've written things is just really slow.
This is a section of my code up to the first while loop, which I cannot get to run to completion.
import numpy as np
import pandas as pd
data = pd.read_csv(r'C:\Users\willo\Desktop\TF_60nm_2_2.txt')
def recursive_low_pass(rawsignal, startcoeff, endcoeff, filtercoeff):
# The current signal length
ni = len(rawsignal) # signal size
rougheventlocations = np.zeros(shape=(100000, 3))
# The algorithm parameters
# filter coefficient
a = filtercoeff
raw = np.array(rawsignal).astype(np.float)
# thresholds
s = startcoeff
e = endcoeff # for event start and end thresholds
# The recursive algorithm
# loop init
ml = np.zeros(ni)
vl = np.zeros(ni)
s = np.zeros(ni)
ml[0] = np.mean(raw) # local mean init
vl[0] = np.var(raw) # local variance init
i = 0 # sample counter
numberofevents = 0 # number of detected events
# main loop
while i < (ni - 1):
i = i + 1
# local mean low pass filtering
ml[i] = a * ml[i - 1] + (1 - a) * raw[i]
# local variance low pass filtering
vl[i] = a * vl[i - 1] + (1 - a) * np.power([raw[i] - ml[i]],2)
# local threshold to detect event start
sl = ml[i] - s * np.sqrt(vl[i])
I'm not getting any error messages, but I've let the program run for more than 10 minutes without any results, so I assume I'm doing something incorrectly.
You should try to vectorize this process rather than accessing/processing indexes (otherwise why use numpy).
The other thing is that you seem to be doing unnecessary work (unless we're not seeing the whole function).
the line:
sl = ml[i] - s * np.sqrt(vl[i])
assigns the variable sl which you're not using inside the loop (or anywhere else). This assignment performs a whole vector multiplication by s which is all zeroes. If you do need the sl variable, you should calculate it outside of the loop using the last encountered values of ml[i] and vl[i] which you can store in temporary variables instead of computing on every loop.
If ni is in the millions, this unnecessary vector multiplication (of millions of zeros) is going to be very costly.
You probably didn't mean to override the value of s = startcoeff with s = np.zeros(ni) in the first place.
In order to vectorize these calculations you will need to use np.acumulate with some customized functions.
The non-numpy equivalent would be as follows (using itertools instead):
from itertools import accumulate
ml = [np.mean(raw)]+[0]*(ni-1)
mlSums = accumulate(zip(ml,raw),lambda r,d:(a*r[0] + (1-a)*d[1],0))
ml = [v for v,_ in mlSums]
vl = [np.var(raw)]+[0]*(ni-1)
vlSums = accumulate(zip(vl,raw,ml),lambda r,d:(a*r[0] + (1-a)*(d[1]-d[2])**2,0,0))
vl = [v for v,_,_ in vlSums]
In each case, the ml / vl vectors are initialized with the base value at index zero and the rest filled with zeroes.
The accumulate(zip(... function calls go through the array and call the lambda function with the current sum in r and the paired elements in d. For the ml calculation, this corresponds to r = (ml[i-1],_) and d = (0,raw[i]).
Because accumulate ouputs the same date type as it is given as input (which are zipped tuples), the actual result is only the first value of the tuples in the mlSums/vlSums lists.
This took 9.7 seconds to process for 8,477,960 items in the lists.

How to improve Python iteration performance over large files

I have a reference file that is about 9,000 lines and has the following structure: (index, size) - where index is unique but size may not be.
0 193532
1 10508
2 13984
3 14296
4 12572
5 12652
6 13688
7 14256
8 230172
9 16076
And I have a data file that is about 650,000 lines and has the following structure: (cluster, offset, size) - where offset is unique but size is not.
446 0xdf6ad1 34572
447 0xdf8020 132484
451 0xe1871b 11044
451 0xe1b394 7404
451 0xe1d12b 5892
451 0xe1e99c 5692
452 0xe20092 6224
452 0xe21a4b 5428
452 0xe23029 5104
452 0xe2455e 138136
I need to compare each size value in the second column of the reference file for any matches with the size values in the third column of the data file. If there is a match, output the offset hex value (second column in the data file) with the index value (first column in the reference file). Currently I am doing this with the following code and just piping it to a new file:
#!/usr/bin/python3
import sys
ref_file = sys.argv[1]
dat_file = sys.argv[2]
with open(ref_file, 'r') as ref, open(dat_file, 'r') as dat:
for r_line in ref:
ref_size = r_line[r_line.find(' ') + 1:-1]
for d_line in dat:
dat_size = d_line[d_line.rfind(' ') + 1:-1]
if dat_size == ref_size:
print(d_line[d_line.find('0x') : d_line.rfind(' ')]
+ '\t'
+ r_line[:r_line.find(' ')])
dat.seek(0)
The typical output looks like this:
0x86ece1eb 0
0x16ff4628f 0
0x59b358020 0
0x27dfa8cb4 1
0x6f98eb88f 1
0x102cb10d4 2
0x18e2450c8 2
0x1a7aeed12 2
0x6cbb89262 2
0x34c8ad5 3
0x1c25c33e5 3
This works fine but takes about 50mins to complete for the given file sizes.
It has done it's job, but as a novice I am always keen to learn ways to improve my coding and share these learnings. My question is, what changes could I make to improve the performance of this code?
You can do the following, take a dictionary dic and do the following ( following is a pseudocode, also I assume sizes don't repeat )
for index,size in the first file:
dic[size] = index
for index,offset,size in second file:
if size in dic.keys():
print dic[size],offset
Since you look up lines in the files by size, these sizes should be the keys in any dictionary data structure. This dictionary you will need to get rid of the nested loop which is the real performance killer here. Furthermore, as your sizes are not unique, you will have to use lists of offset / index values (depending on which file you want store in the dictionary). A defaultdict will help you avoid some clunky code:
from collections import defaultdict
with open(ref_file, 'r') as ref, open(dat_file, 'r') as dat:
dat_dic = defaultdict(list) # maintain a list of offsets for each size
for d_line in dat:
_, offset, size = d_line.split()
dat_dic[size].append(offset)
for r_line in ref:
index, size = r_line.split()
for offset in dat_dic[size]:
# dict lookup is O(1) and not O(N) ...
# ... as looping over the dat_file is
print('{offset}\t{index}'.format(offset=offset, index=index))
If the order of your output lines does not matter you can think about doing it the other way around because your dat_file is so much bigger and thus building the defaultdict from it uses a lot more RAM.

Beyond for-looping: high performance parsing of a large, well formatted data file

I am looking to optimize the performance of a big data parsing problem I have using python. In case anyone is interested: the data shown below is segments of whole genome DNA sequence alignments for six primate species.
Currently, the best way I know how to proceed with this type of problem is to open each of my ~250 (size 20-50MB) files, loop through line by line and extract the data I want. The formatting (shown in examples) is fairly regular although there are important changes at each 10-100 thousand line segment. Looping works fine but it is slow.
I have been using numpy recently for processing massive (>10 GB) numerical data sets and I am really impressed at how quickly I am able to perform different computations on arrays. I wonder if there are some high-powered solutions for processing formatted text that circumvents tedious for-looping?
My files contain multiple segments with the pattern
<MULTI-LINE HEADER> # number of header lines mirrors number of data columns
<DATA BEGIN FLAG> # the word 'DATA'
<DATA COLUMNS> # variable number of columns
<DATA END FLAG> # the pattern '//'
<EMPTY LINE>
Example:
# key to the header fields:
# header_flag chromosome segment_start segment_end quality_flag chromosome_data
SEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)
SEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)
SEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)
DATA
GGGGGG
CCCCTC
...... # continue for 10-100 thousand lines
//
SEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11562256 11579393 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 219047970 219064053 -1 (chr_length=229942017)
DATA
CCCC
GGGG
.... # continue for 10-100 thousand lines
//
<ETC>
I will use segments where the species homo_sapiens and macaca_mulatta are both present in the header, and field 6, which I called the quality flag in the comments above, equals '1' for each species. Since macaca_mulatta does not appear in the second example, I would ignore this segment completely.
I care about segment_start and segment_end coordinates for homo_sapiens only, so in segments where homo_sapiens is present, I will record these fields and use them as keys to a dict(). segment_start also tells me the first positional coordinate for homo_sapiens, which increases strictly by 1 for each line of data in the current segment.
I want to compare the letters (DNA bases) for homo_sapiens and macaca_mulatta. The header line where homo_sapiens and macaca_mulatta appear (i.e. 1 and 5 in the first example) correspond to the column of data representing their respective sequences.
Importantly, these columns are not always the same, so I need to check the header to get the correct indices for each segment, and to check that both species are even in the current segment.
Looking at the two lines of data in example 1, the relevant information for me is
# homo_sapiens_coordinate homo_sapiens_base macaca_mulatta_base
11388669 G G
11388670 C T
For each segment containing info for homo_sapiens and macaca_mulatta, I will record start and end for homo_sapiens from the header and each position where the two DO NOT match into a list. Finally, some positions have "gaps" or lower quality data, i.e.
aaa--A
I will only record from positions where homo_sapiens and macaca_mulatta both have valid bases (must be in the set ACGT) so the last variable I consider is a counter of valid bases per segment.
My final data structure for a given file is a dictionary which looks like this:
{(segment_start=i, segment_end=j, valid_bases=N): list(mismatch positions),
(segment_start=k, segment_end=l, valid_bases=M): list(mismatch positions), ...}
Here is the function I have written to carry this out using a for-loop:
def human_macaque_divergence(chromosome):
"""
A function for finding the positions of human-macaque divergent sites within segments of species alignment tracts
:param chromosome: chromosome (integer:
:return div_dict: a dictionary with tuple(segment_start, segment_end, valid_bases_in_segment) for keys and list(divergent_sites) for values
"""
ch = str(chromosome)
div_dict = {}
with gz.open('{al}Compara.6_primates_EPO.chr{c}_1.emf.gz'.format(al=pd.align, c=ch), 'rb') as f:
# key to the header fields:
# header_flag chromosome segment_start segment_end quality_flag chromosome_info
# SEQ homo_sapiens 1 14163 24841 1 (chr_length=249250621)
# flags, containers, counters and indices:
species = []
starts = []
ends = []
mismatch = []
valid = 0
pos = -1
hom = None
mac = None
species_data = False # a flag signalling that the lines we are viewing are alignment columns
for line in f:
if 'SEQ' in line: # 'SEQ' signifies a segment info field
assert species_data is False
line = line.split()
if line[2] == ch and line[5] == '1': # make sure that the alignment is to the desired chromosome in humans quality_flag is '1'
species += [line[1]] # collect each species in the header
starts += [int(line[3])] # collect starts and ends
ends += [int(line[4])]
if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species):
species_data = True
# get the indices to scan in data columns:
hom = species.index('homo_sapiens')
mac = species.index('macaca_mulatta')
pos = starts[hom] # first homo_sapiens positional coordinate
continue
if species_data and '//' not in line:
assert pos > 0
# record the relevant bases:
human = line[hom]
macaque = line[mac]
if {human, macaque}.issubset(bases):
valid += 1
if human != macaque and {human, macaque}.issubset(bases):
mismatch += [pos]
pos += 1
elif species_data and '//' in line: # '//' signifies segment boundary
# store segment results if a boundary has been reached and data has been collected for the last segment:
div_dict[(starts[hom], ends[hom], valid)] = mismatch
# reset flags, containers, counters and indices
species = []
starts = []
ends = []
mismatch = []
valid = 0
pos = -1
hom = None
mac = None
species_data = False
elif not species_data and '//' in line:
# reset flags, containers, counters and indices
species = []
starts = []
ends = []
pos = -1
hom = None
mac = None
return div_dict
This code works fine (perhaps it could use some tweaking), but my real question is whether or not there might be a faster way to pull this data without running the for-loop and examining each line? For example, loading the whole file using f.read() takes less than a second although it creates a pretty complicated string. (In principle, I assume that I could use regular expressions to parse at least some of the data, such as the header info, but I'm not sure if this would necessarily increase performance without some bulk method to process each data column in each segment).
Does anyone have any suggestions as to how I circumvent looping through billions of lines and parse this kind of text file in a more bulk manner?
Please let me know if anything is unclear in comments, happy to edit or respond directly to improve the post!
Yes you could use some regular expressions to make extract the data in one-go; this is probably the best ratio of effort/performances.
If you need more performances, you could use mx.TextTools to build a finite state machine; I'm pretty confident this will be significantly faster, but the effort needed to write the rules and the learning curve might discourage you.
You also could split the data in chunks and parallelize the processing, this could help.
When you have working code and need to improve performance, use a profiler and measure the effect of one optimization at a time. (Even if you don't use the profiler, definitely do the latter.) Your present code looks reasonable, that is, I don't see anything "stupid" in it in terms of performance.
Having said that, it is likely to be worthwhile to use precompiled regular expressions for all string matching. By using re.MULTILINE, you can read in an entire file as a string and pull out parts of lines. For example:
s = open('file.txt').read()
p = re.compile(r'^SEQ\s+(\w+)\s+(\d+)\s+(\d+)\s+(\d+)', re.MULTILINE)
p.findall(s)
produces:
[('homo_sapiens', '1', '11388669', '11532963'),
('pan_troglodytes', '1', '11517444', '11668750'),
('gorilla_gorilla', '1', '11607412', '11751006'),
('pongo_pygmaeus', '1', '218866021', '219020464'),
('macaca_mulatta', '1', '14425463', '14569832'),
('callithrix_jacchus', '7', '45949850', '46115230')]
You will then need to post-process this data to deal with the specific conditions in your code, but the overall result may be faster.
Your code looks good, but there are particular things that could be improved, such as the use of map, etc.
For good guide on performance tips in Python see:
https://wiki.python.org/moin/PythonSpeed/PerformanceTips
I have used the above tips to get code working nearly as fast as C code. Basically, try to avoid for loops (use map), try to use find built-in functions, etc. Make Python work for you as much as possible by using its builtin functions, which are largely written in C.
Once you get acceptable performance you can run in parallel using:
https://docs.python.org/dev/library/multiprocessing.html#module-multiprocessing
Edit:
I also just realized you are opening a compressed gzip file. I suspect a significant amount of time is spent decompressing it. You can try to make this faster by multi-threading it with:
https://code.google.com/p/threadzip/
You can combine re with some fancy zipping in list comprehensions that can replace the for loops and try to squeeze some performance gains. Below I outline a strategy for segmenting the data file read in as an entire string:
import re
from itertools import izip #(if you are using py2x like me, otherwise just use zip for py3x)
s = open('test.txt').read()
Now find all header lines, and the corresponding index ranges in the large string
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
#head
#['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)',
# '\nSEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)',
#...
#head_inds
#[(107, 169),
# (169, 234),
# (234, 299),
# (299, 366),
# (366, 430),
# (430, 498),
# (1035, 1097),
# (1097, 1162)
# ...
Now, do the same for the data (lines of code with bases)
data_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\n[AGCT-]+.*', s)]
data = [ d[0] for d in data_info]
data_inds = [ (d[1],d[2]) for d in data_info]
Now, whenever there is a new segment, there will be a discontinuity between head_inds[i][1] and head_inds[i+1][0]. Same for data_inds. We can use this knowledge to find the beginning and end of each segment as follows
head_seg_pos = [ idx+1 for idx,(i,j) in enumerate( izip( head_inds[:-1], head_inds[1:])) if j[0]-i[1]]
head_seg_pos = [0] + head_seg_pos + [len(head_seg_pos)] # add beginning and end which we will use next
head_segmented = [ head[s1:s2] for s1,s2 in izip( head_seg_pos[:-1], head_seg_pos[1:]) ]
#[['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)'],
#['\nSEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)',
# ...
and the same for the data
data_seg_pos = [ idx+1 for idx,(i,j) in enumerate( izip( data_inds[:-1], data_inds[1:])) if j[0]-i[1]]
data_seg_pos = [0] + data_seg_pos + [len(data_inds)] # add beginning and end for the next step
data_segmented = [ data[s1:s2] for s1,s2 in izip( data_seg_pos[:-1], data_seg_pos[1:]) ]
Now we can group the segmented data and segmented headers, and only keep groups with data on homo_sapiens and macaca_mulatta
groups = [ [h,d] for h,d in izip( head_segmented, data_segmented) if all( [sp in ''.join(h) for sp in ('homo_sapiens','macaca_mulatta')] ) ]
Now you have a groups array, where each group has
group[0][0] #headers for segment 0
#['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)']
groups[0][1] # data from segment 0
#['\nGGGGGG',
# '\nCCCCTC',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# ...
The next step in the processing I will leave up to you, so I don't steal all the fun. But hopefully this gives you a good idea on using list comprehension to optimize code.
Update
Consider the simple test case to gauge efficiency of the comprehensions combined with re:
def test1():
with open('test.txt','r') as f:
head = []
for line in f:
if line.startswith('SEQ'):
head.append( line)
return head
def test2():
s = open('test.txt').read()
head = re.findall( '\nSEQ.*', s)
return head
%timeit( test1() )
10000 loops, best of 3: 78 µs per loop
%timeit( test2() )
10000 loops, best of 3: 37.1 µs per loop
Even if I gather additional information using re
def test3():
s = open('test.txt').read()
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
%timeit( test3() )
10000 loops, best of 3: 50.6 µs per loop
I still get speed gains. I believe this may be faster in your case to use list comprehensions. However, the for loop might actually beat the comprehension (I take back what I said before) in end, consider
def test1(): #similar to how you are reading in the data in your for loop above
with open('test.txt','r') as f:
head = []
data = []
species = []
species_data = False
for line in f:
if line.startswith('SEQ'):
head.append( line)
species.append( line.split()[1] )
continue
if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species):
species_data = True
continue
if species_data and '//' not in line:
data.append( line )
continue
if species_data and line.startswith( '//' ):
species_data = False
species = []
continue
return head, data
def test3():
s = open('test.txt').read()
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
data_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\n[AGCT-]+.*', s)]
data = [ h[0] for h in data_info]
data_inds = [ (h[1],h[2]) for h in data_info]
return head,data
In this case, as the iterations become more complex, the traditional for loop wins
In [24]: %timeit(test1())
10000 loops, best of 3: 135 µs per loop
In [25]: %timeit(test3())
1000 loops, best of 3: 256 µs per loop
Though I can still use re.findall twice and beat the for loop:
def test4():
s = open('test.txt').read()
head = re.findall( '\nSEQ.*',s )
data = re.findall( '\n[AGTC-]+.*',s)
return head,data
In [37]: %timeit( test4() )
10000 loops, best of 3: 79.5 µs per loop
I guess as the processing of each iteration becomes increasingly complex, the for loop will win, though there might be a more clever way to continue on with re. I wish there was a standard way to determine when to use either.
File processing with Numpy
The data itself appears to be completely regular and can be processed easily with Numpy. The header is only a tiny part of the file and the processing speed thereof is not very relevant. So the idea is to swich to Numpy for only the raw data and, other than that, keep the existing loops in place.
This approach works best if the number of lines in a data segment can be determined from the header. For the remainder of this answer I assume this is indeed the case. If this is not possible, the starting and ending points of data segments have to be determined with e.g. str.find or regex. This will still run at "compiled C speed" but the downside being that the file has to be looped over twice. In my opinion, if your files are only 50MB it's not a big problem to load a complete file into RAM.
E.g. put something like the following under if species_data and '//' not in line:
# Define `import numpy as np` at the top
# Determine number of rows from header data. This may need some
# tuning, if possible at all
nrows = max(ends[i]-starts[i] for i in range(len(species)))
# Sniff line length, because number of whitespace characters uncertain
fp = f.tell()
ncols = len(f.readline())
f.seek(fp)
# Load the data without loops. The file.read method can do the same,
# but with numpy.fromfile we have it in an array from the start.
data = np.fromfile(f, dtype='S1', count=nrows*ncols)
data = data.reshape(nrows, ncols)
# Process the data without Python loops. Here we leverage Numpy
# to really speed up the processing.
human = data[:,hom]
macaque = data[:,mac]
valid = np.in1d(human, bases) & np.in1d(macaque, bases)
mismatch = (human != macaque)
pos = starts[hom] + np.flatnonzero(valid & mismatch)
# Store
div_dict[(starts[hom], ends[hom], valid.sum())] = pos
# After using np.fromfile above, the file pointer _should_ be exactly
# in front of the segment termination flag
assert('//' in f.readline())
# Reset the header containers and flags
...
So the elif species_data and '//' in line: case has become redundant and the containers and flags can be reset in the same block as the above. Alternatively, you could also remove the assert('//' in f.readline()) and keep the elif species_data and '//' in line: case and reset containers and flags there.
Caveats
For relying on the file pointer to switch between processing the header and the data, there is one caveat: (in CPython) iterating a file object uses a read-ahead buffer causing the file pointer to be further down the file than you'd expect. When you would then use numpy.fromfile with that file pointer, it skips over data at the start of the segment and moreover it reads into the header of the next segment. This can be fixed by exclusively using the file.readline method. We can conveniently use it as an iterator like so:
for line in iter(f.readline, ''):
...
For determining the number of bytes to read with numpy.fromfile there is another caveat: Sometimes there is a single line termination character \n at the end of a line and other times two \r\n. The first is the convention on Linux/OSX and the latter on Windows. There is os.linesep to determine the default, but obviously for file parsing this isn't robust enough. So in the code above the length of a data line is determined by actually reading a line, checking the len, and putting back the file pointer to the start of the line.
When you encounter a data segment ('DATA' in line) and the desired species are not in it, you should be able to calculate an offset and f.seek(f.tell() + offset) to the header of the next segment. Much better than looping over data you're not even interested in!

List index out of range

So I am calling some data through SQL queries and I am running into an error of list index range when attempting to loop through it, normalizing and plotting it.
Here's my SQLs:
s1 = db.dict.execute("SELECT sp.wavelength, sp.flux, spt.spectral_type, s.designation, s.shortname, s.names FROM spectra AS sp JOIN sources AS s ON sp.source_id=s.id JOIN spectral_types as spt ON spt.source_id=s.id WHERE sp.instrument_id=6 AND sp.mode_id=2 AND 10<spt.spectral_type<19").fetchall()
s2 = db.dict.execute("SELECT sp.wavelength, sp.flux, spt.spectral_type, s.designation, s.shortname, s.names FROM spectra AS sp JOIN sources AS s ON sp.source_id=s.id JOIN spectral_types as spt ON spt.source_id=s.id WHERE sp.instrument_id=9 AND sp.wavelength_order='n3' AND 10<spt.spectral_type<19").fetchall()
s3 = db.dict.execute("SELECT sp.wavelength, sp.flux, spt.spectral_type, s.designation, s.shortname, s.names FROM spectra AS sp JOIN sources AS s ON sp.source_id=s.id JOIN spectral_types as spt ON spt.source_id=s.id WHERE sp.instrument_id=16 AND 10<spt.spectral_type<19").fetchall()
Then I am combining them into S:
S = s1+s2+s3
Finally, I want to loop through them to normalize, trim and plot all the entries in the dictionary I called.
for n,i in enumerate(S):
W,F = S[n]['wavelength'], S[n]['flux'] # here I define wavelength and flux from SQL query "S"
band = [1.15,1.325] #The range at which I want to normalize, see next line
S1 = at.norm_spec([W,F], band) # here I normalize W and F and define S1 as the normalized W,F
W3 = S1[n][0] #here I define a new W and F but from the normalized S1 spectrum file
F3 = S1[n][1]
W2,F2 =[p[np.where(np.logical_and(W3>1.15, W3<1.325))] for p in [W3,F3]] #here I trim the noisy ends of data and narrow to a small range
z = np.polyfit(W2, F2, 3) #from here on it's just fitting polynomials and plotting
f = np.poly1d(z)
yvalue = 6*(max(F2)/9)
xvalue = 6*(max(W2)/9)
W_new = np.linspace(W2[0], W2[-1], 5000)
F_new = f(W_new)
plt.ylabel('Flux F$_{\lambda}$')
plt.xlabel('Wavelength ($\mu$m)')
name= S[n]['designation']
name2= S[n]['shortname']
name3= S[n]['names']
plt.annotate('{0} \n Spectral type: {1}'.format(S[n][('designation' or 'shortname' or 'names')], S[n]['spectral_type']), xy=(xvalue, yvalue), xytext=(xvalue, yvalue), color='black')
#plt.figure()
plt.plot(W2,F2, 'k-', W_new, F_new, 'g-')
Now, it goes through the first iteration, meaning it plots S1[0][0] and S1[0][1], but it breaks and says S1[1][0] and S1[1][1] is out of range:
61 print len(S1)
62 #print S1[n]['wavelength']
--->63 W3 = S1[n][0] #here I define a new W and F but from the normalized S1 spectrum file
64 F3 = S1[n][1]
65 W2,F2 =[p[np.where(np.logical_and(W3>1.15, W3<1.325))] for p in [W3,F3]] #here I trim the noisy ends and narrow to potassium lines
IndexError: list index out of range
I really don't see where my error is in this, any help will be appreciated!
Sara
If I understood your code well, firstly you're iterating over 'n' objects of S, but then S1 is a whole new object which quantity isn't "n", but some other (length of this vector, or something). Maybe just W3 = S1[0] would be correct? Though, I don't know what at.norm_spec and what type of object is returnig, so I'm only guessing.
Suggestion: Write (more) meaningful variables in your code. It's really, really hard to read something like that and it's very, very easy to made a mistake writing that kind of code. And it's not pythonic.

Categories