Store many variables in a file

Store many variables in a file - python

I'm trying to store many variables in a file. I've tried JSON, pickle and shelve but they all seem to only take one variable
import shelve
myShelve = shelve.open('my.shelve')
myShelve.update(aasd,
basd,
casd,
dasd,
easd,
fasd,
gasd,
hasd,
iasd,
jasd)
myShelve.close()
And pickle
import pickle
with open("vars.txt", "wb") as File:
pickle.dumps(aasd,
basd,
casd,
dasd,
easd,
fasd,
gasd,
hasd,
iasd,
jasd,
File)
The errors I'm getting are along the lines of
TypeError: update() takes at most 2 positional arguments (11 given)
and
TypeError: pickle.dumps() takes at most 2 positional argument (11 given)
I'm not sure if there's any other way of storing variables except using a database, but that's a bit over what I'm currently capable of I'd say.

You can only pickle one variable at a time, but it can be a dict or other Python object. You could store your many variables in one object and pickle that object.
import pickle
class Box:
pass
vars = Box()
vars.x = 1
vars.y = 2
vars.z = 3
with open("save_vars.pickle", "wb") as f:
f.write(pickle.dumps(vars))
with open("save_vars.pickle", "rb") as f:
v = pickle.load(f)
assert vars.__dict__ == v.__dict__

using pickle, you dump one object at a time. Each time you dump to the file, you add another "record".
import pickle
with open("vars.txt", "wb") as File:
for item in (aasd,
basd,
casd,
dasd,
easd,
fasd,
gasd,
hasd,
iasd,
jasd)
pickle.dump(item,File)
Now, on when you want to get your data back, you use pickle.load to read the next "record" from the file:
import pickle
with open('vars.txt') as fin:
aasd = pickle.load(fin)
basd = pickle.load(fin)
...
Alternatively, depending on the type of data, assuming the data is stuff that json is able to serialize, you can store it in a json list:
import json
# dump to a string, but you could use json.dump to dump it to a file.
json.dumps([aasd,
basd,
casd,
dasd,
easd,
fasd,
gasd,
hasd,
iasd,
jasd])

EDIT: I just thought of a different way to store your variables, but it is a little weird, and I wonder what the gurus think about this.
You can save a file that has the python code of your variable definitions in it, for example vars.py which consists of simple statements defining your values:
x = 30
y = [1,2,3]
Then to load that into your program, just do from vars import * and you will have x and y defined, as if you had typed them in.
Original normal answer below...
There is a way using JSON to get your variables back without redefining their names, but you do have to create a dictionary of variables first.
import json
vars={} # the dictionary we will save.
LoL = [ range(5), list("ABCDE"), range(5) ]
vars['LOList'] = LoL
vars['x'] = 24
vars['y'] = "abc"
with open('Jfile.txt','w') as myfile:
json.dump(vars,myfile,indent=2)
Now to load them back:
with open('Jfile.txt','r') as infile:
D = json.load(infile)
# The "trick" to get the variables in as x,y,etc:
globals().update(D)
Now x and y are defined from their dictionary entries:
print x,y
24 abc
There is also an alternative using variable-by-variable definitions. In this way, you don't have to create the dictionary up front, but you do have to re-name the variables in proper order when you load them back in.
z=26
w="def"
with open('Jfile2.txt','w') as myfile:
json.dump([z,w],myfile,indent=2)
with open('Jfile2.txt','r') as infile:
zz,ww = json.load(infile)
And the output:
print zz,ww
26 def

Related

Directly calling SeqIO.parse() in for loop works, but using it separately beforehand doesn't? Why?

In python this code, where I directly call the function SeqIO.parse() , runs fine:
from Bio import SeqIO
a = SeqIO.parse("a.fasta", "fasta")
records = list(a)
for asq in SeqIO.parse("a.fasta", "fasta"):
print("Q")
But this, where I first store the output of SeqIO.parse() in a variable(?) called a, and then try to use it in my loop, it doesn't run:
from Bio import SeqIO
a = SeqIO.parse("a.fasta", "fasta")
records = list(a)
for asq in a:
print("Q")
Is this because a the output from the function || SeqIO.parse("a.fasta", "fasta") || is being stored in 'a' differently from when I directly call it?
What exactly is the identity of 'a' here. Is it a variable? Is it an object? What does the function actually return?

SeqIO.parse() returns a normal python generator. This part of the Biopython module is written in pure python:
>>> from Bio import SeqIO
>>> a = SeqIO.parse("a.fasta", "fasta")
>>> type(a)
<class 'generator'>
Once a generator is iterated over it is exhausted as you discovered. You can't rewind a generator but you can store the contents in a list or dict if you don't mind putting it all in memory (useful if you need random access). You can use SeqIO.to_dict(a) to store in a dictionary with the record ids as the keys and sequences as the values. Simply re-building the generator calling SeqIO.parse() again will avoid dumping the file contents into memory of course.

I have a similar issue that the parsed sequence file doesn't work inside a for-loop. Code below:
genomes_l = pd.read_csv('test_data.tsv', sep='\t', header=None, names=['anonymous_gsa_id', 'genome_id'])
# sample_f = SeqIO.parse('SAMPLE.fasta', 'fasta')
for i, r in genomes_l.iterrows():
genome_name = r['anonymous_gsa_id']
genome_ids = r['genome_id'].split(',')
genome_contigs = [rec for rec in SeqIO.parse('SAMPLE.fasta', 'fasta') if rec.id in genome_ids]
with open(f'out_dir/{genome_name}_contigs.fasta', 'w') as handle:
SeqIO.write(genome_contigs, handle, 'fasta')
Originally, I read the file in as sample_f, however inside the loop it wouldn't work. Would appreciate any help to avoid having to read the file over and over again. Specifically the below line:
genome_contigs = [rec for rec in SeqIO.parse('SAMPLE.fasta', 'fasta') if rec.id in genome_ids]
Thank you!

Writing multiple variables to a file using a function

I am writing a function that exports variables as a dictionary to an external file.
The problem comes when calling that function from another script. I think it has something to do with the globals() parameter.
import sys
import os
mydict = {} #'initialising" the an empty dictionary to be used locally in the function below
def writeToValues(name):
fileName = os.path.splitext(os.path.basename(sys.argv[0]))[0]
valuePrint=open("values.py","a")
def namestr(obj,namespace):
return[name for name in namespace if namespace[name] is obj]
b = namestr(name, globals())
c = "".join(str(x) for x in b)
mydict[(c)] = name
valuePrint.write(fileName)
valuePrint.write("=")
valuePrint.write(str(mydict))
valuePrint.write("\n")
valuePrint.close()
return mydict
a = 2
b = 3
writeToValues(a)
writeToValues(b)
I get the following result:
Main Junkfile={'a': 2, 'b': 3}
note the word Main Junkfile is the name of the script I ran as that is what the function first does, to get the name of the file and use that to name the dictionary.
Now help me as I cannot generate the same if I import the function from another script.
Another problem is that running the script twice generates the values in steps.
Main Junkfile={'a': 2}
Main Junkfile={'b': 3, 'a': 2}
I cannot change the file open mode from append to write since I want to store values from other scripts, too.

this is not perfect but might help as an example:
import sys
import os
mydict = {}
def namestr(obj,namespace):
return[name for name in namespace if namespace[name] is obj]
def writeto(name):
fout = 'values.py'
filename = os.path.splitext(os.path.basename(sys.argv[0]))[0]
with open (fout, 'a') as f:
b = namestr(name, globals())
c = "".join(str(x) for x in b)
mydict[(c)] = name
data = filename + '=' + str(mydict) + '\n'
f.write(data)
return mydict
a = 2
b = 3
if __name__ == '__main__':
writeto(a)
writeto(b)

First of all, to get the current executing script name, or rather the module that called your function you'll have to pick it up from the stack trace. Same goes for globals() - it will execute in the same context of writeToValues() function so it won't be picking up globals() from the 'caller'. To remedy that you can use the inspect module:
import inspect
import os
def writeToValues(name):
caller = inspect.getmodule(inspect.stack()[1][0])
caller_globals = caller.__dict__ # use this instead of globals()
fileName = os.path.splitext(os.path.basename(caller.__file__))[0]
# etc.
This will ensure that you get the name of the module that imported your script and is calling writeToValues() within it.
Keep in mind that this is a very bad idea if you intend to write usable Python files - if your script name has spaces (like in your example) it will write a variable name with spaces, which will further result in syntax error if you try to load the resulting file into a Python interpreter.
Second, why in the name of all things fluffy are you trying to do a reverse lookup to find a variable name? You are aware that:
a = 2
b = 2
ab = 5
writeToValues(b)
will write {"ab": 2}, and not {"b": 2} making it both incorrect in intent (saves the wrong var) as well as in state representation (saves a wrong value), right? You should pass a variable name you want to store/update instead to ensure you're picking up the right property.
The update part is more problematic - you need to update your file, not just merely append to it. That means that you need to find the line of your current script, remove it and then write a new dict with the same name in its place. If you don't expect your file to grow to huge proportions (i.e. you're comfortable having it partially in the working memory), you could do that with:
import os
import inspect
def writeToValues(name):
caller = inspect.getmodule(inspect.stack()[1][0])
caller_globals = caller.__dict__ # use this instead of globals()
caller_name = os.path.splitext(os.path.basename(caller.__file__))[0]
# keep 'mydict' list in the caller space so multiple callers can use this
target_dict = caller_globals['mydict'] = caller_globals.get('mydict', {})
if name not in caller_globals: # the updated value no longer exists, remove it
target_dict.pop(name, None)
else:
target_dict[name] = caller_globals[name]
# update the 'values.py':
# optionaly check if you should update - if values didn't change no need for slow I/O
with open("values.py", "a+") as f:
last_pos = 0 # keep the last non-update position
while True:
line = f.readline() # we need to use readline() for tell() accuracy
if not line or line.startswith(caller_name): # break at the matching line or EOF
break
last_pos = f.tell() # new non-update position
append_data = f.readlines() # store in memory the rest of the file content, if any
f.seek(last_pos) # rewind to the last non-update position
f.truncate() # truncate the rest of the file
f.write("".join((caller_name, " = ", str(target_dict), "\n"))) # write updated dict
if append_data: # write back the rest of the file, if truncated
f.writelines(append_data)
return target_dict
Otherwise use a temp file to write everything as you read it, except for the line matching your current script, append the new value for the current script, delete the original and rename the temp file to values.py.
So now if you store the above in, say, value_writter.py and use it in your script my_script.py as:
import value_writter
a = 2
b = 3
value_writter.write_to_values("a")
value_writter.write_to_values("b")
a = 5
value_writter.write_to_values("a")
# values.py contains: my_script = {"a": 5, "b": 3}
Same should go for any script you import it to. Now, having multiple scripts edit the same file without a locking mechanism is an accident waiting to happen, but that's a whole other story.
Also, if your values are complex the system will break (or rather printout of your dict will not look properly). Do yourself a favor and use some proper serialization, even the horrible pickle is better than this.

Possible to store multiple data structures in one file for save and load (python)?

I want to write an array and a dictionary to a file (and possible more), and then be able to read the file later and recreate the array and dictionary from the file. Is there a reasonable way to do this in Python?

I recommend you use shelve (comes with python). For example:
import shelve
d = shelve.open('file.txt') # in this file you will save your variables
d['mylist'] = [1, 2, 'a'] # thats all, but note the name for later.
d['mydict'] = {'a':1, 'b':2}
d.close()
To read values:
import shelve
d = shelve.open('file.txt')
my_list = d['mylist'] # the list is read from disk
my_dict = d['mydict'] # the dict is read from disk
If you are going to be saving numpy arrays then I recommend you use joblib which is optimized for this use case.

Pickle would be one way to go about it (it is in the standard library).
import pickle
my_dict = {'a':1, 'b':2}
# write to file
pickle.dump(my_dict, open('./my_dict.pkl', 'wb'))
#load from file
my_dict = pickle.load(open('./my_dict.pkl', 'rb'))
And for the array you can use the ndarray.dump() method in numpy, which is more efficient for large arrays.
import numpy as np
my_ary = np.array([[1,2], [3,4]])
my_ary.dump( open('./my_ary.pkl', 'wb'))
But you can of course also put everything into the same pickle file or use shelve (which uses pickle) like suggested in the other answer.

The pickle format blurs the line between data and code and I don't like using it except when I'm the only writer and reader of the data in question and when I'm sure that it's not been tampered with.
If your data structure is just non sequence types, dicts and lists, you can serialise it into json using the json module. This is a pure data format which can be read back reliably. It doesn't handle tuples though but treats them as lists.
Here's an example.
a = [1,2,3,4]
b = dict(lang="python", author="Guido")
import json
with open("data.dump", "r") as f:
x, y = json.load(f)
print x # => [1, 2, 3, 4]
print y # => {u'lang': u'python', u'author': u'Guido'}
It's not totally unchanged but often good enough.

Iterate over list using mmap - Python

Is it possible to iterate over a list using mmap file?
The point is that the list is too big (over 3 000 000 items). I need to have a fast access to this list when I start the program, so I can't load it to a memory after starting program because it takes several seconds.
with open('list','rb') as f:
mmapList = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) # As far as I'm concerned, now I have the list mapped in a virtual memory.
Now, I want to iterate over this list.
for a in mmapList does not work.
EDIT: The only way I know is to save the list items as rows in txt file and then use readline but I'm curious if there is a better and faster way.

You don't need to use mmap to iterate though the cPickled list. All you need to do is instead of pickle'ing the whole list, pickle and dump each element, then read them one by one from the file (can use a generator for that).
Code:
import pickle
def unpickle_iter(f):
while True:
try:
obj = pickle.load(f)
except EOFError:
break
yield obj
def save_list(list, path):
with open(path, 'w') as f:
for i in list:
pickle.dump(i, f)
def load_list(path):
with open(path, 'r') as f:
# here is your nice "for a in mmaplist" equivalent:
for obj in unpickle_iter(f):
print 'Loaded object:', obj
save_list([1,2,'hello world!', dict()], 'test-pickle.dat')
load_list('test-pickle.dat')
Output:
Loaded object: 1
Loaded object: 2
Loaded object: hello world!
Loaded object: {}

Populating a dictionary from a csv file with extremely large field sizes

I've received an error trying to import a .csv file from the csv module when my field size exceeded 131,072. The csv module exports files with fields exceeding 131,072. It's my value for the dictionary with the massive size. My keys are small. Do I need a different file format to store dictionaries with huge values?
I use csv throughout my program, using it consistently is convenient. If multiple data types is unavoidable, what is a good alternative? I'd like to store values which could be thousands-millions of characters in length.
Here's the error message
dictionary = e.csv_import(filename)
File "D:\Matt\Documents\Projects\Python\Project 17\e.py", line 8, in csv_import
for key, value in csv.reader(open(filename)):
_csv.Error: field larger than field limit (131072)
Here's my code
def csv_import(filename):
dictionary = {}
for key, value in csv.reader(open(filename)):
dictionary[key] = value
return dictionary
def csv_export(dictionary, filename):
csv_file = csv.writer(open(filename, "w"))
for key, value in dictionary.items():
csv_file.writerow([key, value])

If you're looking for an alternative, you should probably just use pickle. It's much faster, and much easier than converting from and to a .csv file.
eg.
with open(filename) as f:
dictionary = pickle.load(f)
and
with open(filename) as f:
pickle.dump(dictionary, f)
One downside is that it's not easily read by other languages (if that's a consideration)

You can adjust the maximum field size via:
>>> import csv
>>> csv.field_size_limit()
131072
>>> old_size = csv.field_size_limit(1024*1024)
>>> csv.field_size_limit()
1048576
For alternatives see below.
You want a persistent dictionary so you could use the shelve module.
import shelve
# open shelf and write a large value
shelf = shelve.open(filename)
shelf['a'] = 'b' * 200000
shelf.close()
# read it back in
shelf = shelve.open(filename)
>>> print len(shelf['a'])
200000
Under the hood it's using pickle so there are compatibility issues if you wanted to use the shelf file outside of Python. But if compatibility is required, you could use JSON to serialise your dictionary - I assume that the dictionary's values are strings.
import json
def dict_import(filename):
with open(filename) as f:
return json.load(f)
def dict_export(dictionary, filename):
with open(filename, "w") as f:
json.dump(dictionary, f)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Store many variables in a file - python

Related

Directly calling SeqIO.parse() in for loop works, but using it separately beforehand doesn't? Why?

Writing multiple variables to a file using a function

Possible to store multiple data structures in one file for save and load (python)?

Iterate over list using mmap - Python

Populating a dictionary from a csv file with extremely large field sizes

Categories

Resources