How to store a dictionary in a file? - python

i'm rather new to python and coding in general.
I'm writing my own chat statistics bot for russian social net (vk. com).
My question is can i store a dictionary in a file and work with it?
For example:
Userlist=open('userlist.txt', '+')
If lastmessage['uid'] not in Userlist.read():
Userlist.read()[lastmessage.'uid']=1
Userlist.close()
Or do i have to use some side modules like JSON?
Thank you

(Ammended answer in light of clarifying comment: in the while true cycle i want to check, if a user's id is in 'userlist' dictionary (as a key) and if not, add it to this dictionary with value 1. Then i want to rewrite the file with a new dictionary. the file is opened as soon as the program is launched, before the cycle):
For robustly using data on disk as though it were a dictionary you should consider either one of the dbm modules or just using the SQLite3 support.
A dbm file is simply a set of keys and values stored with transparently maintained and used indexing. Once you've opened your dbm file you simply use it exactly like you would any other Python dictionary (with strings as keys). Any changes can simply be flushed and written before closing the file. This is very simple though it offers no special features for locking (or managing consistency in the case where you might have multiple processes writing to the file concurrently) and so on.
On the other hand the incredibly powerful SQLite subsystem, which has been included in the Python standard libraries for many years, allows you to easily treat a set of local file as an SQL database management system ... with all of the features you'd expect from a client/server based system (foreign keys, data type and referential integrity constraint management, views and triggers, indexes, etc).
In your case you could simply have a single table containing a single column. Binding to that database (by its filename) would allow you to query for a user's name with SELECT and add the user's name with INSERT. As your application grows and changes you could add other columns to track when the account was created and when it was most recently used or checked (a couple of time/date stamp columns) and you could create other tables with related data (selected using JOINs, for example).
(Original answer):
In general the processing of storing any internal data structure as a file, or transmitting it over a network connection, is referred to a "serialization." The complementary process of loading or receiving such data and instantiating its contents into a new data structure is referred to (unsurprisingly) as "deserialization."
That's true of all programming languages.
There are many ways to serialize and deserialize data in Python. In particular we have the native (standard library) pickle module which produces files (or strings) which are only intended or use with other processes running Python or we can, as you said, use JSON ... the JavaScript Object Notation which has become the de facto cross-language data structure serialization standard. (There are others such as YAML and XML ... but JSON has come to predominate).
The caveat about using JSON vs. Pickle is that JavaScript (and a number of other programming and scripting languages, uses different semantics for some sorts of "dictionary" (associative array) keys than Python. In particular Python (and Ruby and Lua) treats keys such as "1" (a string containing the digit "one") and 1 or 1.0 (numeric values equal to one) as distinct keys. JavaScript, Perl and some others treats the keys as "scalar" values in which strings like "1" and the the number 1 will evaluate into the same key.
There are some other nuances which can affect the fidelity of your serialization. But that's the easiest to understand. Dictionaries with strings as keys are fine ... mixtures of numeric and string keys are the most likely cause of any troubles you'll encounter using JSON serialization/deserialization in lieu of pickling.

Related

Caching a large data structure in python across instances while maintaining types

I'm looking to use a distributed cache in python. I have a fastApi application and wish that every instance have access to the same data as our load balancer may route the incoming requests differently. The problem is that I'm storing / editing information about a relatively big data set from a arrow feather file and processing it with Vaex. The feather file automaticaly loads the correct types for the data. The data structure I need to store will use a user id as a key and the value will be a large array of arrays of numbers. I've looked at memcache and redis as possible caching solutions, but both seem to store entries as strings / simple values. I'm looking to avoid parsing strings and extra processing on a large amount data. Is there a distributed caching stategy that will let me persist types?
One solution we came up with is to store the data in mutliple feather files in a directory that is accessible to all instances of the app but this seems to be messy as you would need to clean up / delete the files after each session.
Redis 'strings' are actually able to store arbitrary binary data, it isn't limited to actual strings. From https://redis.io/topics/data-types:
Redis Strings are binary safe, this means that a Redis string can contain any kind of data, for instance a JPEG image or a serialized Ruby object.
A String value can be at max 512 Megabytes in length.
Another option is to use Flatbuffers, which is a serialisation protocol specifically designed to allow reading/writing serialised objects without expensive deserialisation.
Although I would suggest reconsidering storing large, complex data structures as cache values. The drawback is that any change will lead to having to rewrite the entire thing in cache which can get expensive, so consider breaking it up into smaller k/v pairs if possible. You could use Redis Hash data type to make this easier to implement.

How to update a saved mapping without loading it into memory?

I'm maintaining some mappings that I need to continually update.
These mappings are implemented as pickle serialized dicts right now.
The update process is like this:
Load the pickle file into memory, so that I have access to the dict
Do any update to the dict and serialize it again.
The problem with this solution is it could consume a lot of memory for large dicts.
I've looked into other solutions like shelve and leveldb, but they could both generates many files instead of one, which is more complex to save to systems like key-value storage.
To read and modify your mappings without reading the entire map into memory, you'll need to store it as an indexed structure in some sort of database. There are lots of databases with good Python bindings that store the data on disk as a file, so that you don't have to worry about database servers or separate index files. Sqlite is almost certainly the most common choice. However, as you pointed out in the comments, the full functionality of an SQL database is probably unnecessary for your purpose, since you really only need to store key-value pairs.
Based on your particular requirements, then, I'd probably recommend vedis. It's a single-file, key-value database which can support very large database sizes (the documentation claims it can handle on the order of terrabytes), which is transactional and thread-safe to boot.

How to modify variables and instances in modules and save it at runtime in Python

I have main.py,header.py and var.py
header.py
import var
class table():
def __init__(self, name):
self.name = name
var.py
month = "jen"
table = "" # tried to make empty container which can save table instance but don't know how
main.py
import header
import var
var.table = header.table(var.month)
var.month = "feb"
And after this program ended, I want that var.table and var.month is modified and saved in var.py.
When your program ends, all your values are lost—unless you save them first, and load them on the next run. There are a variety of different ways to do this; which one you want depends on what kind of data you have and what you're doing with it.
The one thing you never, ever want to do is print arbitrary objects to a file and then try to figure out how to parse them later. If the answer to any of your questions is ast.literal_eval, you're saving things wrong.
One important thing to consider is when you save. If someone quits your program with ^C, and you only save during clean shutdowns, all your changes are gone.
Numpy/Pandas
Numpy and Pandas have their own built-in functions for saving data. See the Numpy docs and Pandas docs for all of the options, but the basic choices are:
Text (e.g., np.savetxt): Portable formats, editable in a spreadsheet.
Binary (e.g., np.save): Small files, fast saving and loading.
Pickle (see below, but also builtin functions): Can save arrays with arbitrary Python objects.
HDF5. If you need HDF5 or NetCDF, you probably already know that you need it.
List of strings
If all you have is a list of single-line strings, you just write them to a file and read them back line by line. It's hard to get simpler, and it's obviously human-readable.
If you need a short name for each value, or need separate sections, but your values are still all simple strings, you may want to look at configparser for CFG/INI files. But as soon as you get more complicated than that, look for a different format.
Python source
If you don't need to save anything, only load data (that your users might want to edit), you can use Python itself as a format—either a module that you import, or a script file that you exec. This can of course be very dangerous, but for a config file that's only being edited by people who already have your entire source code on their computer, that may not be a problem.
JSON and friends
JSON can save a single dict or list to a file and load it back. JSON is built into the Python standard library, and most other languages can also load and save it. JSON files are human-editable, although not beautiful.
JSON dicts and lists can be nested structure with other dicts and lists inside, and can also contain strings, floats, bools, and None, but nothing else. You can extend the json library with converters for other types, but it's a bit of work.
YAML is (almost) a superset of JSON that's easier to extend, and allows for prettier human-editable files. It doesn't have builtin support in the standard library, but there are a number of solid libraries on PyPI, like ruamel.yaml.
Both JSON and YAML can only save one dict or list per file. (The library will let you save multiple objects, but you won't be able to load them back, so be careful.) The simplest way around this is to create one big dict or list with all of you data packed into it. But JSON Lines allows you save multiple JSON dicts in a single file, at the cost of human readability. You can load it just by for line in file: obj = json.loads(obj), and you can save it with just the standard library if you know what you're doing, but you can also find third-party libraries like json-lines to do it for you.
Key-value stores
If what you want to store fits into a dict, but you want to have it on disk all the time instead of explicitly saving and loading, you want a key-value store.
dbm is an old but still functional format, as long as your keys and values are all small-ish strings and you don't have tons of them. Python makes a dbm look like a dict, so you don't need to change most of your code at all.
shelve extends dbm to let you save arbitrary values instead of just strings. It does this by using Pickle (see below), meaning it has the same safety issues, and it can also be slow.
More powerful key-value stores (and related things) are generally called NoSQL databases. There are lots of them nowadays; Redis is one of the popular choices. There's more to learn, but it can be worth it.
CSV
CSV stands for "comma-separated values", although there are variations that use whitespace or other characters. CSV is built into the standard library.
It's a great format when you have a list of objects all with the same fields, as long as all of the members are strings or numbers. But don't try to stretch it beyond that.
CSV files are just barely human-editable as text—but they can be edited very easily in spreadsheet programs like Excel or Google Sheets.
Pickle
Pickle is designed to save and load just about anything. This can be dangerous if you're reading arbitrary pickle files supplied by users, but it can also be very convenient. Pickle actually can't quite save and load everything unless you do a lot of work to add support to some of your types, but there's a third-party library named dill that extends support a lot further.
Pickle files are not at all human-readable, and are only compatible with Python, and sometimes not even with older versions of Python.
SQL
Finally, you can always build a full relational database. This it's quite as scary as it sounds.
Python has a database called sqlite3 built into the standard library.
If that looks too complicated, you may want to consider SQLAlchemy, which lets you store and query data without having to learn the SQL language. Or, if you search around, there are a number of fancier ORMs, and libraries that let you run custom list comprehensions directly against databases, and so on.
Other formats
There are ziklions of other standards out there for data files; a few even come with support in the standard library. They can be useful for special cases—plist files match what Apple uses for preferences on macOS and iOS; netrc files are a long-established way to store a list of server logins; XML is perfect if you have a time machine that can only travel to the year 2000; etc. But usually, you're better off using one of the common formats mentioned above.

Can you permanently change python code by input?

I'm still learning python and am currently developing an API (artificial personal assistant e.g. Siri or Cortana). I was wondering if there was a way to update code by input. For example, if I had a list- would it be possible to PERMANENTLY add a new item even after the program has finished running.
I read that you would have to use SQLite, is that true? And are there any other ways?
Hello J Nowak
I think what you want to do is save the input data to a file (Eg. txt file).
You can view the link below which will show you how to read and write to a text file.
How to read and write to text file in Python
There are planty of methods how you can make your data persistent.
It depends on the task, on the environment etc.
Just a couple examples:
Files (JSON, DBM, Pickle)
NoSQL Databases (Redis, MongoDB, etc.)
SQL Databases (both serverless and client server: sqlite, MySQL, PostgreSQL etc.)
The most simple/basic approach is to use files.
There are even modules that allow to do it transparently.
You just work with your data as always.
See shelve for example.
From the documentation:
A “shelf” is a persistent, dictionary-like object. The difference with
“dbm” databases is that the values (not the keys!) in a shelf can be
essentially arbitrary Python objects — anything that the pickle module
can handle. This includes most class instances, recursive data types,
and objects containing lots of shared sub-objects. The keys are
ordinary strings.
Example of usage:
import shelve
s = shelve.open('test_shelf.db')
try:
s['key1'] = { 'int': 10, 'float':9.5, 'string':'Sample data' }
finally:
s.close()
You work with s just normally, as it were just a normal dictionary.
And it is automatically saved on disk (in file test_shelf.db in this case).
In this case your dictionary is persistent
and will not lose its values after the program restart.
More on it:
https://docs.python.org/2/library/shelve.html
https://pymotw.com/2/shelve/
Another option is to use pickle, which gives you persistence
also, but not magically: you will need read and write data on your own.
Comparison between shelve and pickle:
What is the difference between pickle and shelve?

I need a class that creates a dictionary file that lives on disk

I want to create a very very large dictionary, and I'd like to store it on disk so as not to kill my memory. Basically, my needs are a cross between cPickle and the dict class, in that it's a class that Python treats like a dictionary, but happens to live on the disk.
My first thought was to create some sort of wrapper around a simple MySQL table, but I have to store types in the entries of the structure that MySQL can't even hope to support out of the box.
The simplest way is the shelve module, which works almost exactly like a dictionary:
import shelve
myshelf = shelve.open("filename") # Might turn into filename.db
myshelf["A"] = "First letter of alphabet"
print myshelf["A"]
# ...
myshelf.close() # You should do this explicitly when you're finished
Note the caveats in the module documentation about changing mutable values (lists, dicts, etc.) stored on a shelf (you can, but it takes a bit more fiddling). It uses (c)pickle and dbm under the hood, so it will cheerfully store anything you can pickle.
I don't know how well it performs relative to other solutions, but it doesn't require any custom code or third party libraries.
Look at dbm in specific, and generally the entire Data Persistence chapter in the manual. Most key/value-store databases (gdbm, bdb, metakit, etc.) have a dict-like API which would probably serve your needs (and are fully embeddable so no need to manage an external database process).
File IO is expensive in terms of CPU cycles. So my first thoughts would be in favor of a database.
However, you could also split your "English dictionary" across multiple files so that (say) each file holds words that start with a specific letter of the alphabet (therefore, you'll have 26 files).
Now, when you say I want to create a very very large dictionary, do you mean a python dict or an English dictionary with words and their definitions, stored in a dict (with words as keys and definitions as values)? The second can be easily implemented with cPickle, as you pointed out.
Again, if memory is your main concern, then you'll need to recheck the number of files you want to use, because, if you're pickling dicts into each file, then you want the dicts to not get too big
Perhaps a usable solution for you would be to do this (I am going to assume that all the English words are sorted):
Get all the words in the English language into one file.
Count how many such words there are and split them into as many files as you see fit, depending on how large the files get.
Now, these smaller files contain the words and their meanings
This is how this solution is useful:
Say that your problem is to lookup the definition of a particular word. Now, at runtime, you can read the first word in each file, and determine if the word that you are looking for is in the previous file that you read (you will need a loop counter to check if you are at the last file). Once you have determined which file the word you are looking for is in, then you can open that file and load the contents of that file into the dict.
It's a little difficult to offer a solution without knowing more details about the problem at hand.

Categories