Iterate over list using mmap - Python - python

Is it possible to iterate over a list using mmap file?
The point is that the list is too big (over 3 000 000 items). I need to have a fast access to this list when I start the program, so I can't load it to a memory after starting program because it takes several seconds.
with open('list','rb') as f:
mmapList = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) # As far as I'm concerned, now I have the list mapped in a virtual memory.
Now, I want to iterate over this list.
for a in mmapList does not work.
EDIT: The only way I know is to save the list items as rows in txt file and then use readline but I'm curious if there is a better and faster way.

You don't need to use mmap to iterate though the cPickled list. All you need to do is instead of pickle'ing the whole list, pickle and dump each element, then read them one by one from the file (can use a generator for that).
Code:
import pickle
def unpickle_iter(f):
while True:
try:
obj = pickle.load(f)
except EOFError:
break
yield obj
def save_list(list, path):
with open(path, 'w') as f:
for i in list:
pickle.dump(i, f)
def load_list(path):
with open(path, 'r') as f:
# here is your nice "for a in mmaplist" equivalent:
for obj in unpickle_iter(f):
print 'Loaded object:', obj
save_list([1,2,'hello world!', dict()], 'test-pickle.dat')
load_list('test-pickle.dat')
Output:
Loaded object: 1
Loaded object: 2
Loaded object: hello world!
Loaded object: {}

Related

Write items to a dictionary from a file python

I am trying to write lists from a file and define them to separate values in a dictionary. The text file would look something like this:
[12, 13, 14]
[87, 45, 32]
...
and then the dictionary would look something like this:
{"score_set0": [12, 13, 14], "score_set1": [87, 45, 32]...}
This is the code I have get so far, but it just returns an empty dictionary
def readScoresFile(fileAddr):
dic = {}
i = 0
with open(fileAddr, "r") as f:
x = len(f.readlines())
for line in f:
dic["score_set{}".format(x[i])] = line
i += 1
return dic
I am only programming at GCSE level (UK OCR syllabus if that helps) in year 10. Thanks for any help anyone can give
Also I am trying to do this without pickle module
x = len(f.readlines()) consumed your whole file, so your subsequent loop over f is iterating an exhausted file handle, sees no remaining lines, and exists immediately.
There's zero need to pre-check the length here (and the only use you make of x is trying to index it, which makes no sense; you avoided a TypeError solely because the loop never ran), so just omit that and use enumerate to get the numbers as you go:
def readScoresFile(fileAddr):
dic = {}
with open(fileAddr, "r") as f:
for i, line in enumerate(f): # Let enumerate manage the numbering for you
dic["score_set{}".format(i)] = line # If you're on 3.6+, dic[f'score_set{i}'] = line is nicer
return dic
Note that this does not actually convert the input lines to lists of int (neither did your original code). If you want to do that, you can change:
dic[f'score_set{i}'] = line
to:
dic[f'score_set{i}'] = ast.literal_eval(line) # Add import ast to top of file
to interpret the line as a Python literal, or:
dic[f'score_set{i}'] = json.loads(line) # Add import json to top of file
to interpret each line as JSON (faster, but supports fewer Python types, and some legal Python literals are not legal JSON).
As a rule, you basically never want to use .readlines(); simply iterating over the file handle will get you the lines live and avoid a memory requirement proportionate to the size of the file. (Frankly, I'd have preferred if they'd gotten rid of it in Py3, since list(f) gets the same result if you really need it, and it doesn't create a visible method that encourages you to do "The Wrong Thing" so often).
By operating line-by-line, you eventually store all the data, but that's better than doubling the overhead by storing both the parsed data in the dict and all the string data it came from in the list.
If you're trying to turn the lines into actual Python lists, I suggest using the json module. (Another option would be ast.literal_eval, since the syntax happens to be the same in this case.)
import json
def read_scores_file(file_path):
with open(file_path) as f:
return {
f"score_set{i}": json.loads(line)
for i, line in enumerate(f)
}

Iterating over CSV reader object in Python

I have two CSV files, one of which is likely to contain a few more records that the other. I am writing a function to iterate over each and determine which records are in dump but not liar.
My code is as follows:
def update_lib(x, y):
dump = open(x, newline='')
libr = open(y, newline='')
dump_reader = csv.reader(dump)
for dump_row in dump_reader:
libr_reader = csv.reader(libr)
for libr_row in libr_reader:
if dump_row[0] == libr_row[0]:
break
I am expecting this to take the first row in dump (dump_row) and iterate over each row in library (libr_row) to see if the first elements match. If they do then I want to move to the next row in dump and if not I will do something else eventually.
My issue is that libr_reader appears to "remember" where it is and I can't get it to go back to the first row in libr, even when the break has been reached and I would therefore expect libr_reader to be re-initiated. I have even tried del libr_row and del libr_reader but this doesn't appear to make a difference. I suspect I am misunderstanding iterators, any help gratefully received.
As it's pasted in your question, you'll be creating a libr_reader object every time you iterate over a row in dump_reader.
dump_reader = csv.reader(dump)
for dump_row in dump_reader:
libr_reader = csv.reader(libr)
dump_reader here is created once. Assuming there are 10 rows from dump_reader, you will be creating 10 libr_reader instances, all from the same file handle.
Per our discussion in the comments, you're aware of that, but what you're unaware of is that the reader object is working on the same file handle and thus, is still at the same cursor.
Consider this example:
>>> import io
>>> my_file = io.StringIO("""Line 1
... Another Line
... Finally, a third line.""")
This is creating a simulated file object. Now I'll create a "LineReader" class.
>>> class LineReader:
... def __init__(self, file):
... self.file = file
... def show_me_a_line(self):
... print(self.file.readline())
...
If I use three line readers on the same file, the file still remembers its place:
>>> line_reader = LineReader(my_file)
>>> line_reader.show_me_a_line()
Line 1
>>> second_line_reader = LineReader(my_file)
>>> second_line_reader.show_me_a_line()
Another Line
>>> third_line_reader = LineReader(my_file)
>>> third_line_reader.show_me_a_line()
Finally, a third line.
To the my_file object, there's no material difference between what I just did, and doing this directly. First, I'll "reset" the file to the beginning by calling seek(0):
>>> my_file.seek(0)
0
>>> my_file.readline()
'Line 1\n'
>>> my_file.readline()
'Another Line\n'
>>> my_file.readline()
'Finally, a third line.'
There you have it.
So TL/DR: Files have cursors and remember where they are. Think of the file handle as a thing that remembers where the file is, yes, but also remembers where in the file your program is.

Directly calling SeqIO.parse() in for loop works, but using it separately beforehand doesn't? Why?

In python this code, where I directly call the function SeqIO.parse() , runs fine:
from Bio import SeqIO
a = SeqIO.parse("a.fasta", "fasta")
records = list(a)
for asq in SeqIO.parse("a.fasta", "fasta"):
print("Q")
But this, where I first store the output of SeqIO.parse() in a variable(?) called a, and then try to use it in my loop, it doesn't run:
from Bio import SeqIO
a = SeqIO.parse("a.fasta", "fasta")
records = list(a)
for asq in a:
print("Q")
Is this because a the output from the function || SeqIO.parse("a.fasta", "fasta") || is being stored in 'a' differently from when I directly call it?
What exactly is the identity of 'a' here. Is it a variable? Is it an object? What does the function actually return?
SeqIO.parse() returns a normal python generator. This part of the Biopython module is written in pure python:
>>> from Bio import SeqIO
>>> a = SeqIO.parse("a.fasta", "fasta")
>>> type(a)
<class 'generator'>
Once a generator is iterated over it is exhausted as you discovered. You can't rewind a generator but you can store the contents in a list or dict if you don't mind putting it all in memory (useful if you need random access). You can use SeqIO.to_dict(a) to store in a dictionary with the record ids as the keys and sequences as the values. Simply re-building the generator calling SeqIO.parse() again will avoid dumping the file contents into memory of course.
I have a similar issue that the parsed sequence file doesn't work inside a for-loop. Code below:
genomes_l = pd.read_csv('test_data.tsv', sep='\t', header=None, names=['anonymous_gsa_id', 'genome_id'])
# sample_f = SeqIO.parse('SAMPLE.fasta', 'fasta')
for i, r in genomes_l.iterrows():
genome_name = r['anonymous_gsa_id']
genome_ids = r['genome_id'].split(',')
genome_contigs = [rec for rec in SeqIO.parse('SAMPLE.fasta', 'fasta') if rec.id in genome_ids]
with open(f'out_dir/{genome_name}_contigs.fasta', 'w') as handle:
SeqIO.write(genome_contigs, handle, 'fasta')
Originally, I read the file in as sample_f, however inside the loop it wouldn't work. Would appreciate any help to avoid having to read the file over and over again. Specifically the below line:
genome_contigs = [rec for rec in SeqIO.parse('SAMPLE.fasta', 'fasta') if rec.id in genome_ids]
Thank you!

Store many variables in a file

I'm trying to store many variables in a file. I've tried JSON, pickle and shelve but they all seem to only take one variable
import shelve
myShelve = shelve.open('my.shelve')
myShelve.update(aasd,
basd,
casd,
dasd,
easd,
fasd,
gasd,
hasd,
iasd,
jasd)
myShelve.close()
And pickle
import pickle
with open("vars.txt", "wb") as File:
pickle.dumps(aasd,
basd,
casd,
dasd,
easd,
fasd,
gasd,
hasd,
iasd,
jasd,
File)
The errors I'm getting are along the lines of
TypeError: update() takes at most 2 positional arguments (11 given)
and
TypeError: pickle.dumps() takes at most 2 positional argument (11 given)
I'm not sure if there's any other way of storing variables except using a database, but that's a bit over what I'm currently capable of I'd say.
You can only pickle one variable at a time, but it can be a dict or other Python object. You could store your many variables in one object and pickle that object.
import pickle
class Box:
pass
vars = Box()
vars.x = 1
vars.y = 2
vars.z = 3
with open("save_vars.pickle", "wb") as f:
f.write(pickle.dumps(vars))
with open("save_vars.pickle", "rb") as f:
v = pickle.load(f)
assert vars.__dict__ == v.__dict__
using pickle, you dump one object at a time. Each time you dump to the file, you add another "record".
import pickle
with open("vars.txt", "wb") as File:
for item in (aasd,
basd,
casd,
dasd,
easd,
fasd,
gasd,
hasd,
iasd,
jasd)
pickle.dump(item,File)
Now, on when you want to get your data back, you use pickle.load to read the next "record" from the file:
import pickle
with open('vars.txt') as fin:
aasd = pickle.load(fin)
basd = pickle.load(fin)
...
Alternatively, depending on the type of data, assuming the data is stuff that json is able to serialize, you can store it in a json list:
import json
# dump to a string, but you could use json.dump to dump it to a file.
json.dumps([aasd,
basd,
casd,
dasd,
easd,
fasd,
gasd,
hasd,
iasd,
jasd])
EDIT: I just thought of a different way to store your variables, but it is a little weird, and I wonder what the gurus think about this.
You can save a file that has the python code of your variable definitions in it, for example vars.py which consists of simple statements defining your values:
x = 30
y = [1,2,3]
Then to load that into your program, just do from vars import * and you will have x and y defined, as if you had typed them in.
Original normal answer below...
There is a way using JSON to get your variables back without redefining their names, but you do have to create a dictionary of variables first.
import json
vars={} # the dictionary we will save.
LoL = [ range(5), list("ABCDE"), range(5) ]
vars['LOList'] = LoL
vars['x'] = 24
vars['y'] = "abc"
with open('Jfile.txt','w') as myfile:
json.dump(vars,myfile,indent=2)
Now to load them back:
with open('Jfile.txt','r') as infile:
D = json.load(infile)
# The "trick" to get the variables in as x,y,etc:
globals().update(D)
Now x and y are defined from their dictionary entries:
print x,y
24 abc
There is also an alternative using variable-by-variable definitions. In this way, you don't have to create the dictionary up front, but you do have to re-name the variables in proper order when you load them back in.
z=26
w="def"
with open('Jfile2.txt','w') as myfile:
json.dump([z,w],myfile,indent=2)
with open('Jfile2.txt','r') as infile:
zz,ww = json.load(infile)
And the output:
print zz,ww
26 def

Passing values and calling functions from other functions

I have this class that consists of 3 functions. Each function is in charge of one part of the whole process.
.load() loads up two files, re-formats their content and writes them to two new files.
.compare() takes two files and prints out their differences in a specific format.
.final() takes the result of .compare() and creates a file for every set of values.
Please ignore the Frankenstein nature of the logic as it is not my main concern at the moment. I know it can be written a thousand times better and that's fine by me for now as i am still new to Python and programing in general. I do have some theoretical experience but very limited technical practice and that is something i am working on.
Here is the code:
from collections import defaultdict
from operator import itemgetter
from itertools import groupby
from collections import deque
import os
class avs_auto:
def load(self, fileIn1, fileIn2, fileOut1, fileOut2):
with open(fileIn1+'.txt') as fin1, open(fileIn2+'.txt') as fin2:
frame_rects = defaultdict(list)
for row in (map(str, line.split()) for line in fin1):
id, frame, rect = row[0], row[2], [row[3],row[4],row[5],row[6]]
frame_rects[frame].append(id)
frame_rects[frame].append(rect)
for row in (map(str, line.split()) for line in fin2):
id, frame, rect = row[0], row[2], [row[3],row[4],row[5],row[6]]
frame_rects[frame].append(id)
frame_rects[frame].append(rect)
with open(fileOut1+'.txt', 'w') as fout1, open(fileOut2+'.txt', 'w') as fout2:
for frame, rects in sorted(frame_rects.iteritems()):
fout1.write('{{{}:{}}}\n'.format(frame, rects))
fout2.write('{{{}:{}}}\n'.format(frame, rects))
def compare(self, f1, f2):
with open(f1+'.txt', 'r') as fin1:
with open(f2+'.txt', 'r') as fin2:
lines1 = fin1.readlines()
lines2 = fin2.readlines()
diff_lines = [l.strip() for l in lines1 if l not in lines2]
diffs = defaultdict(list)
with open(f1+'x'+f2+'Result.txt', 'w') as fout:
for line in diff_lines:
d = eval(line)
for k in d:
list_ids = d[k]
for i in range(0, len(d[k]), 2):
diffs[d[k][i]].append(k)
for id_ in diffs:
diffs[id_].sort()
for k, g in groupby(enumerate(diffs[id_]), lambda (i, x): i - x):
group = map(itemgetter(1), g)
fout.write('{0} {1} {2}\n'.format(id_, group[0], group[-1]))
def final(self):
with open('hw1load3xhw1load2Result.txt', 'r') as fin:
lines = (line.split() for line in fin)
for k, g in groupby(lines, itemgetter(0)):
fst = next(g)
lst = next(iter(deque(g, 1)), fst)
with open('final/{}.avs'.format(k), 'w') as fout:
fout.write('video0=ImageSource("MovieName\original\%06d.jpeg", {}, {}, 15)\n'.format(fst[1], lst[2]))
Now to my question, how do i make it so each of the functions passes it's output files as values to the next function and calls it?
So for an example:
running .load() should output two files, call the .compare() function passing it those two files.
Then when .compare() is done, it should pass .final() the output file and calls it.
So .final() will open whatever file is passed to it from .compare() and not "test123.txt" as it is defined above.
I hope this all makes sense. Let me know if you need clarification. Any criticism is welcome concerning the code itself. Thanks in advance.
There are a couple of ways to do this, but I would write a master function that calls the other three in sequence. Something like:
def load_and_compare(self, input_file1, input_file2, output_file1, output_file2, result_file):
self.load(input_file1, input_file2, output_file1, output_file2)
self.compare(output_file1, output_file2)
self.final(result_file)
Looking over your code, I think you have a problem in load. You only declare a single dictionary, then load the contents of both files into it and write those same contents out to two files. Because each file has the same content, compare won't do anything meaningful.
Also, do you really want to write out the file contents and then re-read it into memory? I would keep the frame definitions in memory for use in compare after loading rather than reading them back in.
I don't really see a reason for this to be a class at all rather than just a trio of functions, but maybe if you have to read multiple files with mildly varying formats you could get some benefit of using class attributes to define the format while inheriting the general logic.
Do you mean call with the name of the two files? Well you defined a class, so you can just do:
def load(self, fileIn1, fileIn2, fileOut1, fileOut2):
... // do stuff here
// when done
self.compare( fileOut1, fileOut2 )
And so on.
I might be totally off here, but why don't you do it exactly as you're saying?
Just call self.compare() out of your load() method.
You can also add return statements to load() and return a tuple with the files.
Then add a 4th method to your class, which then collects the returned files and pipes them to the compare() method.
Best Regards!
One of the more powerful aspects of Python is that you can return something called a tuple. To answer this in a more generic Python sense consider this code:
>>> def load(file1, file2):
return file1+'.txt',file2+'.txt'
>>> def convert(file1, file2):
return 'converted_'+file1,'converted_'+file2
>>> convert(*load("Java", "C#"))
('converted_Java.txt', 'converted_C#.txt')
Each function takes two named arguments, but the returned tuple of the first can be "unpacked" into the input arguments of the second by adding a * in front of it.

Categories