I hope the question is not too unspecific: I have a huge database-like list (~ 900,000 entries) which I want to use for processing text files. (More details below.) Since this list will be edited and used with other programs as well, I would prefer to keep it in one separate file and include it in the python code, either directly or by dumping it to some format that python can use. I was wondering if you can advice on what would be the quickest and most efficient way. I have looked at several options, but may not have seen what is best:
Include the list as a python dictionary in the form
my_list = { "key": "value" }
directly into my python code.
Dump the list to an sqlite database and use the sqlite3 module.
Have the list as a yml file and use the yaml module.
Any ideas how these approaches would scale if I process a text file and want to do replacements on something like 30,000 lines?
For those interested: this is for linguistic processing, in particular ancient Greek. The list is an exhaustive list of Greek forms and the head words that they are derived from. For every word form in a text file, I want to add the dictionary head word.
Point 1 is much faster than using either YAML or SQL as #b4hand and #DeepSpace indicated. What you should do though is not include the list in the rest of the rest of the python code you are developing, as you indicated, but make a separate .py file with just the that dictionary definition.
That way the list in that file is more easy to write from a program (or extend by a program). And, on first import, a .pyc will be created which speeds up re-reading on further runs of your program. This is actually very similar in performance
to using the pickle module and pickling the dictionary to file and reading it back from there, while keeping the dictionary in an easy human readable and editable form.
Less than one million entries is not huge and should fit in memory easily. Thus, your best bet is option 1.
If you are looking for speed, option 1 should be the fastest because the other 2 will need to repeatedly access the HD which will be the bottleneck.
I would use a caching mechanism to hold this data or maybe a data structure storage like redis. Loading all of this in memory might become too expensive.
Related
I'm writing something that essentially refines and reports various strings out of an enormous python dictionary (the source file for the dictionary is XML over a million lines long).
I found mongodb yesterday and was delighted to see that it accepts python dictionaries easy as you please... until it refused mine because the dict object is larger than the BSON size limit of 16MB.
I looked at GridFS for a sec, but that won't accept any python object that doesn't have a .read attribute.
Over time, this program will acquire many of these mega dictionaries; I'd like to dump each into a database so that at some point I can compare values between them.
What's the best way to handle this? I'm awfully new to all of this but that's fine with me :) It seems that a NoSQL approach is best; the structure of these is generally known but can change without notice. Schemas would be nightmarish here.
Have your considered using Pandas? Yes Pandas does not natively accept xmls but if you use ElementTree from xml (standard library) you should be able to read it into a Pandas data frame and do what you need with it including refining strings and adding more data to the data frame as you get it.
So I've decided that this problem is more of a data design problem than a python situation. I'm trying to load a lot of unstructured data into a database when I probably only need 10% of it. I've decided to save the refined xml dictionary as a pickle on a shared filesystem for cool storage and use mongo to store the refined queries I want from the dictionary.
That'll reduce their size from 22MB to 100K.
Thanks for chatting with me about this :)
I have main.py,header.py and var.py
header.py
import var
class table():
def __init__(self, name):
self.name = name
var.py
month = "jen"
table = "" # tried to make empty container which can save table instance but don't know how
main.py
import header
import var
var.table = header.table(var.month)
var.month = "feb"
And after this program ended, I want that var.table and var.month is modified and saved in var.py.
When your program ends, all your values are lost—unless you save them first, and load them on the next run. There are a variety of different ways to do this; which one you want depends on what kind of data you have and what you're doing with it.
The one thing you never, ever want to do is print arbitrary objects to a file and then try to figure out how to parse them later. If the answer to any of your questions is ast.literal_eval, you're saving things wrong.
One important thing to consider is when you save. If someone quits your program with ^C, and you only save during clean shutdowns, all your changes are gone.
Numpy/Pandas
Numpy and Pandas have their own built-in functions for saving data. See the Numpy docs and Pandas docs for all of the options, but the basic choices are:
Text (e.g., np.savetxt): Portable formats, editable in a spreadsheet.
Binary (e.g., np.save): Small files, fast saving and loading.
Pickle (see below, but also builtin functions): Can save arrays with arbitrary Python objects.
HDF5. If you need HDF5 or NetCDF, you probably already know that you need it.
List of strings
If all you have is a list of single-line strings, you just write them to a file and read them back line by line. It's hard to get simpler, and it's obviously human-readable.
If you need a short name for each value, or need separate sections, but your values are still all simple strings, you may want to look at configparser for CFG/INI files. But as soon as you get more complicated than that, look for a different format.
Python source
If you don't need to save anything, only load data (that your users might want to edit), you can use Python itself as a format—either a module that you import, or a script file that you exec. This can of course be very dangerous, but for a config file that's only being edited by people who already have your entire source code on their computer, that may not be a problem.
JSON and friends
JSON can save a single dict or list to a file and load it back. JSON is built into the Python standard library, and most other languages can also load and save it. JSON files are human-editable, although not beautiful.
JSON dicts and lists can be nested structure with other dicts and lists inside, and can also contain strings, floats, bools, and None, but nothing else. You can extend the json library with converters for other types, but it's a bit of work.
YAML is (almost) a superset of JSON that's easier to extend, and allows for prettier human-editable files. It doesn't have builtin support in the standard library, but there are a number of solid libraries on PyPI, like ruamel.yaml.
Both JSON and YAML can only save one dict or list per file. (The library will let you save multiple objects, but you won't be able to load them back, so be careful.) The simplest way around this is to create one big dict or list with all of you data packed into it. But JSON Lines allows you save multiple JSON dicts in a single file, at the cost of human readability. You can load it just by for line in file: obj = json.loads(obj), and you can save it with just the standard library if you know what you're doing, but you can also find third-party libraries like json-lines to do it for you.
Key-value stores
If what you want to store fits into a dict, but you want to have it on disk all the time instead of explicitly saving and loading, you want a key-value store.
dbm is an old but still functional format, as long as your keys and values are all small-ish strings and you don't have tons of them. Python makes a dbm look like a dict, so you don't need to change most of your code at all.
shelve extends dbm to let you save arbitrary values instead of just strings. It does this by using Pickle (see below), meaning it has the same safety issues, and it can also be slow.
More powerful key-value stores (and related things) are generally called NoSQL databases. There are lots of them nowadays; Redis is one of the popular choices. There's more to learn, but it can be worth it.
CSV
CSV stands for "comma-separated values", although there are variations that use whitespace or other characters. CSV is built into the standard library.
It's a great format when you have a list of objects all with the same fields, as long as all of the members are strings or numbers. But don't try to stretch it beyond that.
CSV files are just barely human-editable as text—but they can be edited very easily in spreadsheet programs like Excel or Google Sheets.
Pickle
Pickle is designed to save and load just about anything. This can be dangerous if you're reading arbitrary pickle files supplied by users, but it can also be very convenient. Pickle actually can't quite save and load everything unless you do a lot of work to add support to some of your types, but there's a third-party library named dill that extends support a lot further.
Pickle files are not at all human-readable, and are only compatible with Python, and sometimes not even with older versions of Python.
SQL
Finally, you can always build a full relational database. This it's quite as scary as it sounds.
Python has a database called sqlite3 built into the standard library.
If that looks too complicated, you may want to consider SQLAlchemy, which lets you store and query data without having to learn the SQL language. Or, if you search around, there are a number of fancier ORMs, and libraries that let you run custom list comprehensions directly against databases, and so on.
Other formats
There are ziklions of other standards out there for data files; a few even come with support in the standard library. They can be useful for special cases—plist files match what Apple uses for preferences on macOS and iOS; netrc files are a long-established way to store a list of server logins; XML is perfect if you have a time machine that can only travel to the year 2000; etc. But usually, you're better off using one of the common formats mentioned above.
I have a CSV file which is about 1GB big and contains about 50million rows of data, I am wondering is it better to keep it as a CSV file or store it as some form of a database. I don't know a great deal about MySQL to argue for why I should use it or another database framework over just keeping it as a CSV file. I am basically doing a Breadth-First Search with this dataset, so once I get the initial "seed" set the 50million I use this as the first values in my queue.
Thanks,
I would say that there are a wide variety of benefits to using a database over a CSV for such large structured data so I would suggest that you learn enough to do so. However, based on your description you might want to check out non-server/lighter weight databases. Such as SQLite, or something similar to JavaDB/Derby... or depending on the structure of your data a non-relational (Nosql) database- obviously you will need one with some type of python support though.
If you want to search on something graph-ish (since you mention Breadth-First Search) then a graph database might prove useful.
Are you just going to slurp in everything all at once? If so, then CSV is probably the way to go. It's simple and works.
If you need to do lookups, then something that lets you index the data, like MySQL, would be better.
From your previous questions, it looks like you are doing social-network searches against facebook friend data; so I presume your data is a set of 'A is-friend-of B' statements, and you are looking for a shortest connection between two individuals?
If you have enough memory, I would suggest parsing your csv file into a dictionary of lists. See Can this breadth-first search be made faster?
If you cannot hold all the data at once, a local-storage database like SQLite is probably your next-best alternative.
There are also some python modules which might help:
graph-tool http://projects.skewed.de/graph-tool/
python-graph http://pypi.python.org/pypi/python-graph/1.8.0
networkx http://networkx.lanl.gov/
igraph http://igraph.sourceforge.net/
How about some key-value storages like MongoDB
I want to create a very very large dictionary, and I'd like to store it on disk so as not to kill my memory. Basically, my needs are a cross between cPickle and the dict class, in that it's a class that Python treats like a dictionary, but happens to live on the disk.
My first thought was to create some sort of wrapper around a simple MySQL table, but I have to store types in the entries of the structure that MySQL can't even hope to support out of the box.
The simplest way is the shelve module, which works almost exactly like a dictionary:
import shelve
myshelf = shelve.open("filename") # Might turn into filename.db
myshelf["A"] = "First letter of alphabet"
print myshelf["A"]
# ...
myshelf.close() # You should do this explicitly when you're finished
Note the caveats in the module documentation about changing mutable values (lists, dicts, etc.) stored on a shelf (you can, but it takes a bit more fiddling). It uses (c)pickle and dbm under the hood, so it will cheerfully store anything you can pickle.
I don't know how well it performs relative to other solutions, but it doesn't require any custom code or third party libraries.
Look at dbm in specific, and generally the entire Data Persistence chapter in the manual. Most key/value-store databases (gdbm, bdb, metakit, etc.) have a dict-like API which would probably serve your needs (and are fully embeddable so no need to manage an external database process).
File IO is expensive in terms of CPU cycles. So my first thoughts would be in favor of a database.
However, you could also split your "English dictionary" across multiple files so that (say) each file holds words that start with a specific letter of the alphabet (therefore, you'll have 26 files).
Now, when you say I want to create a very very large dictionary, do you mean a python dict or an English dictionary with words and their definitions, stored in a dict (with words as keys and definitions as values)? The second can be easily implemented with cPickle, as you pointed out.
Again, if memory is your main concern, then you'll need to recheck the number of files you want to use, because, if you're pickling dicts into each file, then you want the dicts to not get too big
Perhaps a usable solution for you would be to do this (I am going to assume that all the English words are sorted):
Get all the words in the English language into one file.
Count how many such words there are and split them into as many files as you see fit, depending on how large the files get.
Now, these smaller files contain the words and their meanings
This is how this solution is useful:
Say that your problem is to lookup the definition of a particular word. Now, at runtime, you can read the first word in each file, and determine if the word that you are looking for is in the previous file that you read (you will need a loop counter to check if you are at the last file). Once you have determined which file the word you are looking for is in, then you can open that file and load the contents of that file into the dict.
It's a little difficult to offer a solution without knowing more details about the problem at hand.
in my program i have a method which requires about 4 files to be open each time it is called,as i require to take some data.all this data from the file i have been storing in list for manupalation.
I approximatily need to call this method about 10,000 times.which is making my program very slow?
any method for handling this files in a better ways and is storing the whole data in list time consuming what is better alternatives for list?
I can give some code,but my previous question was closed as that only confused everyone as it is a part of big program and need to be explained completely to understand,so i am not giving any code,please suggest ways thinking this as a general question...
thanks in advance
As a general strategy, it's best to keep this data in an in-memory cache if it's static, and relatively small. Then, the 10k calls will read an in-memory cache rather than a file. Much faster.
If you are modifying the data, the alternative might be a database like SQLite, or embedded MS SQL Server (and there are others, too!).
It's not clear what kind of data this is. Is it simple config/properties data? Sometimes you can find libraries to handle the loading/manipulation/storage of this data, and it usually has it's own internal in-memory cache, all you need to do is call one or two functions.
Without more information about the files (how big are they?) and the data (how is it formatted and structured?), it's hard to say more.
Opening, closing, and reading a file 10,000 times is always going to be slow. Can you open the file once, do 10,000 operations on the list, then close the file once?
It might be better to load your data into a database and put some indexes on the database. Then it will be very fast to make simple queries against your data. You don't need a lot of work to set up a database. You can create an SQLite database without requiring a separate process and it doesn't have a complicated installation process.
Call the open to the file from the calling method of the one you want to run. Pass the data as parameters to the method
If the files are structured, kinda configuration files, it might be good to use ConfigParser library, else if you have other structural format then I think it would be better to store all this data in JSON or XML and perform any necessary operations on your data