How to improve Python iteration performance over large files

How to improve Python iteration performance over large files - python

I have a reference file that is about 9,000 lines and has the following structure: (index, size) - where index is unique but size may not be.
0 193532
1 10508
2 13984
3 14296
4 12572
5 12652
6 13688
7 14256
8 230172
9 16076
And I have a data file that is about 650,000 lines and has the following structure: (cluster, offset, size) - where offset is unique but size is not.
446 0xdf6ad1 34572
447 0xdf8020 132484
451 0xe1871b 11044
451 0xe1b394 7404
451 0xe1d12b 5892
451 0xe1e99c 5692
452 0xe20092 6224
452 0xe21a4b 5428
452 0xe23029 5104
452 0xe2455e 138136
I need to compare each size value in the second column of the reference file for any matches with the size values in the third column of the data file. If there is a match, output the offset hex value (second column in the data file) with the index value (first column in the reference file). Currently I am doing this with the following code and just piping it to a new file:
#!/usr/bin/python3
import sys
ref_file = sys.argv[1]
dat_file = sys.argv[2]
with open(ref_file, 'r') as ref, open(dat_file, 'r') as dat:
for r_line in ref:
ref_size = r_line[r_line.find(' ') + 1:-1]
for d_line in dat:
dat_size = d_line[d_line.rfind(' ') + 1:-1]
if dat_size == ref_size:
print(d_line[d_line.find('0x') : d_line.rfind(' ')]
+ '\t'
+ r_line[:r_line.find(' ')])
dat.seek(0)
The typical output looks like this:
0x86ece1eb 0
0x16ff4628f 0
0x59b358020 0
0x27dfa8cb4 1
0x6f98eb88f 1
0x102cb10d4 2
0x18e2450c8 2
0x1a7aeed12 2
0x6cbb89262 2
0x34c8ad5 3
0x1c25c33e5 3
This works fine but takes about 50mins to complete for the given file sizes.
It has done it's job, but as a novice I am always keen to learn ways to improve my coding and share these learnings. My question is, what changes could I make to improve the performance of this code?

You can do the following, take a dictionary dic and do the following ( following is a pseudocode, also I assume sizes don't repeat )
for index,size in the first file:
dic[size] = index
for index,offset,size in second file:
if size in dic.keys():
print dic[size],offset

Since you look up lines in the files by size, these sizes should be the keys in any dictionary data structure. This dictionary you will need to get rid of the nested loop which is the real performance killer here. Furthermore, as your sizes are not unique, you will have to use lists of offset / index values (depending on which file you want store in the dictionary). A defaultdict will help you avoid some clunky code:
from collections import defaultdict
with open(ref_file, 'r') as ref, open(dat_file, 'r') as dat:
dat_dic = defaultdict(list) # maintain a list of offsets for each size
for d_line in dat:
_, offset, size = d_line.split()
dat_dic[size].append(offset)
for r_line in ref:
index, size = r_line.split()
for offset in dat_dic[size]:
# dict lookup is O(1) and not O(N) ...
# ... as looping over the dat_file is
print('{offset}\t{index}'.format(offset=offset, index=index))
If the order of your output lines does not matter you can think about doing it the other way around because your dat_file is so much bigger and thus building the defaultdict from it uses a lot more RAM.

Related

How to change unit of list that was read from a text file?

I'm trying to change the unit of a value in a text file. First part of the task was to turn the string list to a float, but now (without using index()) I want to change the unit of the elemets which are kbps to mbps, so the 1200 value to 1.2.
This is the code that turns the values of the list to float:
bw = [] #another comment: create an empty list
with open("task4.txt") as file_name:
for line in file_name:
a = line.split() #The split() method splits a string into a list - whitspace as a default
bw.append(a[0]) #only appending the first value
floats = [float(line) for line in bw] #Turnes the string into a float
print(bw)
The text file looks like this:
7 Mbps
1200 Kbps
15 Mbps
32 Mbps
I need the list to become 7, 1.2, 15, 32 without changing the textfile, nor use index. I want a programs that finds all kbps values and turns them into mbps

You have to check the first letter of the units to determine whether to divide by 1000 to convert Kpbs to Mbps.
bw = [] #another comment: create an empty list
with open("task4.txt") as file_name:
for line in file_name:
speed, unit = line.split()
speed = float(speed)
if unit.startswith('K'):
speed /= 1000
bw.append(speed)
print(bw)

If you want both the Mbps and Kbps in the list you could try my code below:
with open('task4.txt', 'r') as file:
sizes = []
for line in file:
size, unit = line.strip().split()
if unit == 'Mbps':
sizes.append(float(size))
else:
sizes.append(float(size)/1000)
print(sizes)

To start, you can open and parse the file into a list by line:
speeds = []
with open("task4.txt", "r", encoding="utf-8") as file:
speeds = file.read().splitlines()
After reading the file, the task becomes a one-liner.
speeds = [str(int(x.split()[0]) / 1000) + " " + x.split()[1] if "kbps" in x.lower() else x for x in speeds]
For readability's sake, I have included a more verbose solution with comments here, although it does the exact same thing as the above:
# Create a temporary array to hold the converted speeds
new_speeds = []
# For each speed in the list
for speed in speeds:
# If the speed is in Kbps
if "kbps" in speed.lower():
# Fetch the raw speed integer and the unit name from the string
n_kbps, unit = speed.split()
# Convert the raw speed integer to Mbps
n_kbps /= 100
# Add the converted speed to the new_speeds array
new_speeds.append(f"{n_kbps} {unit}")
# If the speed is not in Kbps
else:
# Simply add the existing speed to the new_speeds array
new_speeds.append(speed)
# Set the original speeds array to the resulting array
speeds = new_speeds

Python input/output more efficiently

I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra.
I have a catalog file which contains all the information I need, like the coordinates x, y, the radius r, the strength s, etc. The catalog also contains the information to target which file I am going to read (identified by n1, n2) and which spectra in that file I am going to use (identified by n3).
The code I have now is:
import numpy as np
from itertools import izip
import fitsio
x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
for line1, line2 in izip(file_ID,file_c):
parts1 = line1.split()
parts2 = line2.split()
n1.append(parts1[0])
n2.append(parts1[1])
n3.append(float(parts1[2]))
x.append(float(parts2[0]))
y.append(float(parts2[1]))
r.append(float(parts2[2]))
s.append(float(parts2[3]))
def data_analysis(idx_start,idx_end): #### loop over 10 million entries
data_stru = np.zeros((idx_end-idx_start), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])
for i in xrange(idx_start,idx_end)
filename = "../../../data/" + str(n1[i]) + "/spPlate-" + str(n1[i]) + "-" + str(n2[i]) + ".fits"
fits_spectra = fitsio.FITS(filename)
fluxx = fits_spectra[0][n3[i]-1:n3[i],0:4000] #### return a list of list
flux = fluxx[0]
hdu = fits_spectra[0].read_header()
wave_start = hdu['CRVAL1']
logwave = wave_start + 0.0001 * np.arange(4000)
wavegrid = np.power(10,logwave)
##### After I read the flux and the wavegrid, then I can do my following analysis.
##### save data to data_stru
##### Reading is the most time-consuming part of this code, my later analysis is not time consuming.
The problem is that the files are too big, there is no enough memory to load it at once, and my catalog is not structured such that all entries which will open the same file are grouped together. I wonder is there anyone who can offer some thoughts to split the large loop into two loops: 1) first loop over the files so that we can avoid repeatedly opening/reading files again and again, 2) loop over the entries which are going to use the same file.

If I understand your code correctly, n1 and n2 determine which file to open. So why do you not just lexsort them. You can then use itertools.groupby to group records with the same n1, n2. Here is a down-scaled proof of concept:
import itertools
n1 = np.random.randint(0, 3, (10,))
n2 = np.random.randint(0, 3, (10,))
mockdata = np.arange(10)+100
s = np.lexsort((n2, n1))
for k, g in itertools.groupby(zip(s, n1[s], n2[s]), lambda x: x[1:]):
# groupby groups the iterations i of its first argument
# (zip(...) in this case) by the result of applying the
# optional second argument (here lambda) to i.
# Here we use the lambda expression to remove si from the
# tuple (si, n1si, n2si) that zip produces because otherwise
# equal (n1si, n2si) pairs would still be treated as different
# because of the distinct si's. Hence no grouping would occur.
# Putting si in there in the first place is necessary, so we
# we can reference the other records of the corresponding row
# in the inner loop.
print(k)
for si, n1s, ns2 in g:
# si can be used to access the corresponding other records
print (si, mockdata[si])
Prints something like:
(0, 1)
4 104
(0, 2)
0 100
2 102
6 106
(1, 0)
1 101
(2, 0)
8 108
9 109
(2, 1)
3 103
5 105
7 107
You may want to include n3 in the lexsort, but not the grouping so you can process the files' content in order.

Beyond for-looping: high performance parsing of a large, well formatted data file

I am looking to optimize the performance of a big data parsing problem I have using python. In case anyone is interested: the data shown below is segments of whole genome DNA sequence alignments for six primate species.
Currently, the best way I know how to proceed with this type of problem is to open each of my ~250 (size 20-50MB) files, loop through line by line and extract the data I want. The formatting (shown in examples) is fairly regular although there are important changes at each 10-100 thousand line segment. Looping works fine but it is slow.
I have been using numpy recently for processing massive (>10 GB) numerical data sets and I am really impressed at how quickly I am able to perform different computations on arrays. I wonder if there are some high-powered solutions for processing formatted text that circumvents tedious for-looping?
My files contain multiple segments with the pattern
<MULTI-LINE HEADER> # number of header lines mirrors number of data columns
<DATA BEGIN FLAG> # the word 'DATA'
<DATA COLUMNS> # variable number of columns
<DATA END FLAG> # the pattern '//'
<EMPTY LINE>
Example:
# key to the header fields:
# header_flag chromosome segment_start segment_end quality_flag chromosome_data
SEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)
SEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)
SEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)
DATA
GGGGGG
CCCCTC
...... # continue for 10-100 thousand lines
//
SEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11562256 11579393 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 219047970 219064053 -1 (chr_length=229942017)
DATA
CCCC
GGGG
.... # continue for 10-100 thousand lines
//
<ETC>
I will use segments where the species homo_sapiens and macaca_mulatta are both present in the header, and field 6, which I called the quality flag in the comments above, equals '1' for each species. Since macaca_mulatta does not appear in the second example, I would ignore this segment completely.
I care about segment_start and segment_end coordinates for homo_sapiens only, so in segments where homo_sapiens is present, I will record these fields and use them as keys to a dict(). segment_start also tells me the first positional coordinate for homo_sapiens, which increases strictly by 1 for each line of data in the current segment.
I want to compare the letters (DNA bases) for homo_sapiens and macaca_mulatta. The header line where homo_sapiens and macaca_mulatta appear (i.e. 1 and 5 in the first example) correspond to the column of data representing their respective sequences.
Importantly, these columns are not always the same, so I need to check the header to get the correct indices for each segment, and to check that both species are even in the current segment.
Looking at the two lines of data in example 1, the relevant information for me is
# homo_sapiens_coordinate homo_sapiens_base macaca_mulatta_base
11388669 G G
11388670 C T
For each segment containing info for homo_sapiens and macaca_mulatta, I will record start and end for homo_sapiens from the header and each position where the two DO NOT match into a list. Finally, some positions have "gaps" or lower quality data, i.e.
aaa--A
I will only record from positions where homo_sapiens and macaca_mulatta both have valid bases (must be in the set ACGT) so the last variable I consider is a counter of valid bases per segment.
My final data structure for a given file is a dictionary which looks like this:
{(segment_start=i, segment_end=j, valid_bases=N): list(mismatch positions),
(segment_start=k, segment_end=l, valid_bases=M): list(mismatch positions), ...}
Here is the function I have written to carry this out using a for-loop:
def human_macaque_divergence(chromosome):
"""
A function for finding the positions of human-macaque divergent sites within segments of species alignment tracts
:param chromosome: chromosome (integer:
:return div_dict: a dictionary with tuple(segment_start, segment_end, valid_bases_in_segment) for keys and list(divergent_sites) for values
"""
ch = str(chromosome)
div_dict = {}
with gz.open('{al}Compara.6_primates_EPO.chr{c}_1.emf.gz'.format(al=pd.align, c=ch), 'rb') as f:
# key to the header fields:
# header_flag chromosome segment_start segment_end quality_flag chromosome_info
# SEQ homo_sapiens 1 14163 24841 1 (chr_length=249250621)
# flags, containers, counters and indices:
species = []
starts = []
ends = []
mismatch = []
valid = 0
pos = -1
hom = None
mac = None
species_data = False # a flag signalling that the lines we are viewing are alignment columns
for line in f:
if 'SEQ' in line: # 'SEQ' signifies a segment info field
assert species_data is False
line = line.split()
if line[2] == ch and line[5] == '1': # make sure that the alignment is to the desired chromosome in humans quality_flag is '1'
species += [line[1]] # collect each species in the header
starts += [int(line[3])] # collect starts and ends
ends += [int(line[4])]
if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species):
species_data = True
# get the indices to scan in data columns:
hom = species.index('homo_sapiens')
mac = species.index('macaca_mulatta')
pos = starts[hom] # first homo_sapiens positional coordinate
continue
if species_data and '//' not in line:
assert pos > 0
# record the relevant bases:
human = line[hom]
macaque = line[mac]
if {human, macaque}.issubset(bases):
valid += 1
if human != macaque and {human, macaque}.issubset(bases):
mismatch += [pos]
pos += 1
elif species_data and '//' in line: # '//' signifies segment boundary
# store segment results if a boundary has been reached and data has been collected for the last segment:
div_dict[(starts[hom], ends[hom], valid)] = mismatch
# reset flags, containers, counters and indices
species = []
starts = []
ends = []
mismatch = []
valid = 0
pos = -1
hom = None
mac = None
species_data = False
elif not species_data and '//' in line:
# reset flags, containers, counters and indices
species = []
starts = []
ends = []
pos = -1
hom = None
mac = None
return div_dict
This code works fine (perhaps it could use some tweaking), but my real question is whether or not there might be a faster way to pull this data without running the for-loop and examining each line? For example, loading the whole file using f.read() takes less than a second although it creates a pretty complicated string. (In principle, I assume that I could use regular expressions to parse at least some of the data, such as the header info, but I'm not sure if this would necessarily increase performance without some bulk method to process each data column in each segment).
Does anyone have any suggestions as to how I circumvent looping through billions of lines and parse this kind of text file in a more bulk manner?
Please let me know if anything is unclear in comments, happy to edit or respond directly to improve the post!

Yes you could use some regular expressions to make extract the data in one-go; this is probably the best ratio of effort/performances.
If you need more performances, you could use mx.TextTools to build a finite state machine; I'm pretty confident this will be significantly faster, but the effort needed to write the rules and the learning curve might discourage you.
You also could split the data in chunks and parallelize the processing, this could help.

When you have working code and need to improve performance, use a profiler and measure the effect of one optimization at a time. (Even if you don't use the profiler, definitely do the latter.) Your present code looks reasonable, that is, I don't see anything "stupid" in it in terms of performance.
Having said that, it is likely to be worthwhile to use precompiled regular expressions for all string matching. By using re.MULTILINE, you can read in an entire file as a string and pull out parts of lines. For example:
s = open('file.txt').read()
p = re.compile(r'^SEQ\s+(\w+)\s+(\d+)\s+(\d+)\s+(\d+)', re.MULTILINE)
p.findall(s)
produces:
[('homo_sapiens', '1', '11388669', '11532963'),
('pan_troglodytes', '1', '11517444', '11668750'),
('gorilla_gorilla', '1', '11607412', '11751006'),
('pongo_pygmaeus', '1', '218866021', '219020464'),
('macaca_mulatta', '1', '14425463', '14569832'),
('callithrix_jacchus', '7', '45949850', '46115230')]
You will then need to post-process this data to deal with the specific conditions in your code, but the overall result may be faster.

Your code looks good, but there are particular things that could be improved, such as the use of map, etc.
For good guide on performance tips in Python see:
https://wiki.python.org/moin/PythonSpeed/PerformanceTips
I have used the above tips to get code working nearly as fast as C code. Basically, try to avoid for loops (use map), try to use find built-in functions, etc. Make Python work for you as much as possible by using its builtin functions, which are largely written in C.
Once you get acceptable performance you can run in parallel using:
https://docs.python.org/dev/library/multiprocessing.html#module-multiprocessing
Edit:
I also just realized you are opening a compressed gzip file. I suspect a significant amount of time is spent decompressing it. You can try to make this faster by multi-threading it with:
https://code.google.com/p/threadzip/

You can combine re with some fancy zipping in list comprehensions that can replace the for loops and try to squeeze some performance gains. Below I outline a strategy for segmenting the data file read in as an entire string:
import re
from itertools import izip #(if you are using py2x like me, otherwise just use zip for py3x)
s = open('test.txt').read()
Now find all header lines, and the corresponding index ranges in the large string
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
#head
#['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)',
# '\nSEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)',
#...
#head_inds
#[(107, 169),
# (169, 234),
# (234, 299),
# (299, 366),
# (366, 430),
# (430, 498),
# (1035, 1097),
# (1097, 1162)
# ...
Now, do the same for the data (lines of code with bases)
data_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\n[AGCT-]+.*', s)]
data = [ d[0] for d in data_info]
data_inds = [ (d[1],d[2]) for d in data_info]
Now, whenever there is a new segment, there will be a discontinuity between head_inds[i][1] and head_inds[i+1][0]. Same for data_inds. We can use this knowledge to find the beginning and end of each segment as follows
head_seg_pos = [ idx+1 for idx,(i,j) in enumerate( izip( head_inds[:-1], head_inds[1:])) if j[0]-i[1]]
head_seg_pos = [0] + head_seg_pos + [len(head_seg_pos)] # add beginning and end which we will use next
head_segmented = [ head[s1:s2] for s1,s2 in izip( head_seg_pos[:-1], head_seg_pos[1:]) ]
#[['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)'],
#['\nSEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)',
# ...
and the same for the data
data_seg_pos = [ idx+1 for idx,(i,j) in enumerate( izip( data_inds[:-1], data_inds[1:])) if j[0]-i[1]]
data_seg_pos = [0] + data_seg_pos + [len(data_inds)] # add beginning and end for the next step
data_segmented = [ data[s1:s2] for s1,s2 in izip( data_seg_pos[:-1], data_seg_pos[1:]) ]
Now we can group the segmented data and segmented headers, and only keep groups with data on homo_sapiens and macaca_mulatta
groups = [ [h,d] for h,d in izip( head_segmented, data_segmented) if all( [sp in ''.join(h) for sp in ('homo_sapiens','macaca_mulatta')] ) ]
Now you have a groups array, where each group has
group[0][0] #headers for segment 0
#['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)']
groups[0][1] # data from segment 0
#['\nGGGGGG',
# '\nCCCCTC',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# ...
The next step in the processing I will leave up to you, so I don't steal all the fun. But hopefully this gives you a good idea on using list comprehension to optimize code.
Update
Consider the simple test case to gauge efficiency of the comprehensions combined with re:
def test1():
with open('test.txt','r') as f:
head = []
for line in f:
if line.startswith('SEQ'):
head.append( line)
return head
def test2():
s = open('test.txt').read()
head = re.findall( '\nSEQ.*', s)
return head
%timeit( test1() )
10000 loops, best of 3: 78 µs per loop
%timeit( test2() )
10000 loops, best of 3: 37.1 µs per loop
Even if I gather additional information using re
def test3():
s = open('test.txt').read()
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
%timeit( test3() )
10000 loops, best of 3: 50.6 µs per loop
I still get speed gains. I believe this may be faster in your case to use list comprehensions. However, the for loop might actually beat the comprehension (I take back what I said before) in end, consider
def test1(): #similar to how you are reading in the data in your for loop above
with open('test.txt','r') as f:
head = []
data = []
species = []
species_data = False
for line in f:
if line.startswith('SEQ'):
head.append( line)
species.append( line.split()[1] )
continue
if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species):
species_data = True
continue
if species_data and '//' not in line:
data.append( line )
continue
if species_data and line.startswith( '//' ):
species_data = False
species = []
continue
return head, data
def test3():
s = open('test.txt').read()
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
data_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\n[AGCT-]+.*', s)]
data = [ h[0] for h in data_info]
data_inds = [ (h[1],h[2]) for h in data_info]
return head,data
In this case, as the iterations become more complex, the traditional for loop wins
In [24]: %timeit(test1())
10000 loops, best of 3: 135 µs per loop
In [25]: %timeit(test3())
1000 loops, best of 3: 256 µs per loop
Though I can still use re.findall twice and beat the for loop:
def test4():
s = open('test.txt').read()
head = re.findall( '\nSEQ.*',s )
data = re.findall( '\n[AGTC-]+.*',s)
return head,data
In [37]: %timeit( test4() )
10000 loops, best of 3: 79.5 µs per loop
I guess as the processing of each iteration becomes increasingly complex, the for loop will win, though there might be a more clever way to continue on with re. I wish there was a standard way to determine when to use either.

File processing with Numpy
The data itself appears to be completely regular and can be processed easily with Numpy. The header is only a tiny part of the file and the processing speed thereof is not very relevant. So the idea is to swich to Numpy for only the raw data and, other than that, keep the existing loops in place.
This approach works best if the number of lines in a data segment can be determined from the header. For the remainder of this answer I assume this is indeed the case. If this is not possible, the starting and ending points of data segments have to be determined with e.g. str.find or regex. This will still run at "compiled C speed" but the downside being that the file has to be looped over twice. In my opinion, if your files are only 50MB it's not a big problem to load a complete file into RAM.
E.g. put something like the following under if species_data and '//' not in line:
# Define `import numpy as np` at the top
# Determine number of rows from header data. This may need some
# tuning, if possible at all
nrows = max(ends[i]-starts[i] for i in range(len(species)))
# Sniff line length, because number of whitespace characters uncertain
fp = f.tell()
ncols = len(f.readline())
f.seek(fp)
# Load the data without loops. The file.read method can do the same,
# but with numpy.fromfile we have it in an array from the start.
data = np.fromfile(f, dtype='S1', count=nrows*ncols)
data = data.reshape(nrows, ncols)
# Process the data without Python loops. Here we leverage Numpy
# to really speed up the processing.
human = data[:,hom]
macaque = data[:,mac]
valid = np.in1d(human, bases) & np.in1d(macaque, bases)
mismatch = (human != macaque)
pos = starts[hom] + np.flatnonzero(valid & mismatch)
# Store
div_dict[(starts[hom], ends[hom], valid.sum())] = pos
# After using np.fromfile above, the file pointer _should_ be exactly
# in front of the segment termination flag
assert('//' in f.readline())
# Reset the header containers and flags
...
So the elif species_data and '//' in line: case has become redundant and the containers and flags can be reset in the same block as the above. Alternatively, you could also remove the assert('//' in f.readline()) and keep the elif species_data and '//' in line: case and reset containers and flags there.
Caveats
For relying on the file pointer to switch between processing the header and the data, there is one caveat: (in CPython) iterating a file object uses a read-ahead buffer causing the file pointer to be further down the file than you'd expect. When you would then use numpy.fromfile with that file pointer, it skips over data at the start of the segment and moreover it reads into the header of the next segment. This can be fixed by exclusively using the file.readline method. We can conveniently use it as an iterator like so:
for line in iter(f.readline, ''):
...
For determining the number of bytes to read with numpy.fromfile there is another caveat: Sometimes there is a single line termination character \n at the end of a line and other times two \r\n. The first is the convention on Linux/OSX and the latter on Windows. There is os.linesep to determine the default, but obviously for file parsing this isn't robust enough. So in the code above the length of a data line is determined by actually reading a line, checking the len, and putting back the file pointer to the start of the line.
When you encounter a data segment ('DATA' in line) and the desired species are not in it, you should be able to calculate an offset and f.seek(f.tell() + offset) to the header of the next segment. Much better than looping over data you're not even interested in!

Discover different lines across similar files

I have a text file with many tens of thousands short sentences like this:
go to venice
come back from grece
new york here i come
from belgium to russia and back to spain
I run a tagging algorithm which produces a tagged output of this sentence file:
go to <place>venice</place>
come back from <place>grece</place>
<place>new york</place> here i come
from <place>belgium</place> to <place>russia</place> and back to <place>spain</place>
The algorithm runs over the input multiple times and produces each time slightly different tagging. My goal is to identify those lines where those differences occur. In other words, print all utterances for which the tagging differs across N results files.
For example N=10, I get 10 tagged files. Suppose line 1 is tagged all the time the same for all 10 tagged files - do not print it. Suppose line 2 is tagged once this way and 9 times other way - print it. And so on.
For N=2 is easy, I just run diff. But what to do if I have N=10 results?

If you have the tagged files - just create a counter for each line of how many times you've seen it:
# use defaultdict for convenience
from collections import defaultdict
# start counting at 0
counter_dict = defaultdict(lambda: 0)
tagged_file_names = ['tagged1.txt', 'tagged2.txt', ...]
# add all lines of each file to dict
for file_name in tagged_file_names:
with open(file_name) as f:
# use enumerate to maintain order
# produces (LINE_NUMBER, LINE CONTENT) tuples (hashable)
for line_with_number in enumerate(f.readlines()):
counter_dict[line_with_number] += 1
# print all values that do not repeat in all files (in same location)
for key, value in counter_dict.iteritems():
if value < len(tagged_file_names):
print "line number %d: [%s] only repeated %d times" % (
key[0], key[1].strip(), value
)
Walkthrough:
First of all, we create a data structure to enable us counting our entries, which are numbered lines. This data structure is a collections.defaultdict which a default value of 0 - which is the count of newly added lines (increased to 1 with each add).
Then, we create the actual entry using a tuple which is hashable, so it can be used as a dictionary key, and by default deeply-comparable to other tuples. this means (1, "lolz") is equal to (1, "lolz") but different than (1, "not lolz") or (2, lolz) - so it fits our use of deep-comparing lines to account for content as well as position.
Now all that's left to do is add all entries using a straightforward for loop and see what keys (which correspond to numbered lines) appear in all files (that is - their value is equal to the number of tagged files provided).
Example:
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged1.txt
123
abc
def
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged2.txt
123
def
def
reut#tHP-EliteBook-8470p:~/python/counter$ ./difference_counter.py
line number 1: [abc] only repeated 1 times
line number 1: [def] only repeated 1 times

if you compare all of them to the first text, then you can get a list of all texts that are different. this might not be the quickest way but it would work.
import difflib
n1 = '1 2 3 4 5 6'
n2 = '1 2 3 4 5 6'
n3 = '1 2 4 5 6 7'
l = [n1, n2, n3]
m = [x for x in l if x != l[0]]
diff = difflib.unified_diff(l[0], l.index(m))
print ''.join(diff)

Optimizing searches in very large csv files

I have a csv file with a single column, but 6.2 million rows, all containing strings between 6 and 20ish letters. Some strings will be found in duplicate (or more) entries, and I want to write these to a new csv file - a guess is that there should be around 1 million non-unique strings. That's it, really. Continuously searching through a dictionary of 6 million entries does take its time, however, and I'd appreciate any tips on how to do it. Any script I've written so far takes at least a week (!) to run, according to some timings I did.
First try:
in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
out_file_2 = open('UniProt Unique Trypsin Peptides.csv','w+')
writer_1 = csv.writer(out_file_1)
writer_2 = csv.writer(out_file_2)
# Create trypsinome dictionary construct
ref_dict = {}
for row in range(len(in_list_1)):
ref_dict[row] = in_list_1[row]
# Find unique/non-unique peptides from trypsinome
Peptide_list = []
Uniques = []
for n in range(len(in_list_1)):
Peptide = ref_dict.pop(n)
if Peptide in ref_dict.values(): # Non-unique peptides
Peptide_list.append(Peptide)
else:
Uniques.append(Peptide) # Unique peptides
for m in range(len(Peptide_list)):
Write_list = (str(Peptide_list[m]).replace("'","").replace("[",'').replace("]",''),'')
writer_1.writerow(Write_list)
Second try:
in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
writer_1 = csv.writer(out_file_1)
ref_dict = {}
for row in range(len(in_list_1)):
Peptide = in_list_1[row]
if Peptide in ref_dict.values():
write = (in_list_1[row],'')
writer_1.writerow(write)
else:
ref_dict[row] = in_list_1[row]
EDIT: here's a few lines from the csv file:
SELVQK
AKLAEQAER
AKLAEQAERR
LAEQAER
LAEQAERYDDMAAAMK
LAEQAERYDDMAAAMKK
MTMDKSELVQK
YDDMAAAMKAVTEQGHELSNEER
YDDMAAAMKAVTEQGHELSNEERR

Do it with Numpy. Roughly:
import numpy as np
column = 42
mat = np.loadtxt("thefile", dtype=[TODO])
uniq = set(np.unique(mat[:,column]))
for row in mat:
if row[column] not in uniq:
print row
You could even vectorize the output stage using numpy.savetxt and the char-array operators, but it probably won't make very much difference.

First hint : Python has support for lazy evaluation, better to use it when dealing with huge datasets. So :
iterate over your csv.reader instead of building a huge in-memory list,
don't build huge in-memory lists with ranges - use enumerate(seq) instead if you need both the item and index, and just iterate over your sequence's items if you don't need the index.
Second hint : the main point of using a dict (hashtable) is to lookup on keys, not values... So don't build a huge dict that's used as a list.
Third hint : if you just want a way to store "already seen" values, use a Set.

I'm not so good in Python, so I don't know how the 'in' works, but your algorithm seems to run in n².
Try to sort your list after reading it, with an algo in n log(n), like quicksort, it should work better.
Once the list is ordered, you just have to check if two consecutive elements of the list are the same.
So you get the reading in n, the sorting in n log(n) (at best), and the comparison in n.

Although I think that the numpy solution is the best, I'm curious whether we can speed up the given example. My suggestions are:
skip csv.reader costs and just read the line
rb to skip the extra scan needed to fix newlines
use bigger file buffer sizes (read 1Meg, write 64K is probably good)
use the dict keys as an index - key lookup is much faster than value lookup
I'm not a numpy guy, so I'd do something like
in_file_1 = open('UniProt Trypsinome (full).csv','rb', 1048576)
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+', 65536)
ref_dict = {}
for line in in_file_1:
peptide = line.rstrip()
if peptide in ref_dict:
out_file_1.write(peptide + '\n')
else:
ref_dict[peptide] = None

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to improve Python iteration performance over large files - python

You can do the following, take a dictionary dic and do the following ( following is a pseudocode, also I assume sizes don't repeat ) for index,size in the first file: dic[size] = index for index,offset,size in second file: if size in dic.keys(): print dic[size],offset

Related

How to change unit of list that was read from a text file?

Python input/output more efficiently

Beyond for-looping: high performance parsing of a large, well formatted data file

Discover different lines across similar files

Optimizing searches in very large csv files

Categories

Resources