I am new to python. I need a data structure to store counts of some objects. For example, I want to store the most visited webpages. Lets say. I have 100 the most visited webpages. I keep the counts of visits to each webpage. I may need to update the list. I will definitely update the visit-counts. It does not have to be ordered. I will look at the associated visit-count given the webpage ID. I am planning to use a dictionary. Is there a faster way of doing this in python?
The dictionary is an appropriate and fast data structure for this task (mapping webpage IDs to visit counts).
Python dictionaries are implemented using hash tables for fast O(1) access. They are so fast that almost any attempt to avoid them will make code run slower and make the code unpleasant to look at.
P.S. Also take a look at collections.Counter which is specifically designed for this kind of work (counting hits). It is implemented as a dictionary with initial default values set to zero.
Python dictionary object is one of the most optimized parts of the whole Python language and the reason is that dictionaries are used everywhere.
For example normally every object instance of every class uses a dictionary to keep instance data members content, the class is a dictionary containing the methods, the modules use a dictionary to keep the globals, the system uses a dictionary to keep and lookup the modules and so on.
For keeping a counter using a dictionary is a good approach in Python.
Related
I am using MongoDB with Python. I used the following command to insert my documents:
db.test.insert_one({"Name": name, "Age": age})
(name and age are variables within my code)
I used the following command to sort from oldest age to youngest:
db.test.find().sort("Age", 1)
I understand that I am simply issuing a find command within my code. Is there a clever way to use the db.test.save() method to sort my documents and overwrite the original with the sorted documents?
I looked into many questions that somewhat address this problem such as this:
How to store sorted array back to MongoDB?
But in order to do the $push command inside update_one and use the $sort feature, I would have to re-work my entire code. What are my options? Do I have to re-work my code or is there another way?
You are mixing up two concepts in mongo: sorting an array embedded in a document and sorting documents in a collection by some properties.
The post you link to refers to the former. You seem to be trying to permanently sort the documents in a collection.
Mongo stores documents in roughly the order they are written in. Like most databases it uses indexes to make retrieving data more efficient.
In the instance you describe, you would create an index on the age property of your document.
Use either the ensure_index or create_index functions:
db.test.ensure_index("Age", pymongo.DESCENDING)
Henceforth, every query like
db.test.find().sort("Age", -1)
uses the index and will be very fast.
I suppose you could could try rewriting the documents to another collection in sorted prefer and then see whether a simple fetch always comes back on age order. But this really isn't the right way to be thinking about mongo.
Is there a dictionary in python which will only keep the most recently accessed keys. Specifically, I am caching relatively large blobs of data in a dictionary, and I am looking for a way of preventing the dictionary from ballooning in size, and to drop to the variables which were only accessed a long time ago [i.e. to only keep the say the 1000 most recently accessed keys - and when a new key gets added, to drop the key that was accessed the longest ago].
I suspect this is not part of the standard dictionary class, but am hoping there is something analogous.
Sounds like you want a Least Recently Used (LRU) cache.
Here's a Python implementation already: https://pypi.python.org/pypi/lru-dict/
Here's another one: https://www.kunxi.org/blog/2014/05/lru-cache-in-python/
I have some code that I am working on that scrapes some data from a website, and then extracts certain key information from that website and stores it in an object. I create a couple hundred of these objects each day, each from unique url's. This is working quite well, however, I'm inexperienced in what options are available to me in Python for persistence and what would be best suited for my needs.
Currently I am using pickle. To do so, I am keeping all of these webpage objects and appending them in a list as new ones are created, then saving that list to a pickle (then reloading it whenever the list is to be updated). However, as i'm in the order of some GB of data, i'm finding pickle to be somewhat slow. It's not unworkable, but I'm wondering if there is a more well suited alternative. I don't really want to break apart the structure of my objects and store it in a sql type database, as its important for me to keep the methods and the data as a single object.
Shelve is one option I've been looking into, as my impression is then that I wouldn't have to unpickle and pickle all the old entries (just the most recent day that needs to be updated), but am unsure if this is how shelve works, and how fast it is.
So to avoid rambling on, my question is: what is the preferred persistence method for storing a large number of objects (all of the same type), to keep read/write speed up as the collection grows?
Martijn's suggestion could be one of the alternatives.
You may consider to store the pickle objects directly in a sqlite database which still can manage from the python standard library.
Use a StringIO object to convert between the database column and python object.
You didn't mention the size of each object you are pickling now. I guess it should stay well within sqlite's limit.
I would like to have a map data type for one of my entity types in my python google app engine application. I think what I need is essentially the python dict datatype where I can create a list of key-value mappings. I don't see any obvious way to do this with the provided datatypes in app engine.
The reason I'd like to do this is that I have a User entity and I'd like to track within that user a mapping of lessonIds to values that represent that user's status with a particular lesson id. I'd like to do this without creating a whole new entity that might be titled UserLessonStatus and have it reference the User and have to be queried, since I often want to iterate through all the lesson statuses. Maybe it is better done this way, in which case, I'd appreciate opinions that this is how it's best done. Otherwise if someone knows a good way to create a mapping within my User entity itself, that'd be great.
One solution I considered is using two ListProperties in conjunction, i.e. when adding an object append the key and value to each list; when locating, find the index of the string in one list and reference using that index in the other; when removing, find the index in one, use it to remove from each, and so forth.
You're probably better off using another kind, as you suggest. If you do want to store it all in the one entity, though, you have several options - parallel lists, as you suggest, are one option. You could also simply pickle a Python dictionary, assuming you don't want to query on it.
You may want to check out the ndb project, which supports nested entities, which would also be a viable solution.
What is the most efficient method to store a Python dictionary on the disk? The only methods I know of right now are plain-text and the pickle module.
Edit: Sorry for not being very clear. By efficient I meant fastest execution speed. The dictionary will contain mutable objects that will hold information to be parsed and modified.
shelve is pretty nice as well
or this persistent dictionary recipe
for a convenient method that keeps your objects synchronized with storage, there's the ORM SQLAlchemy for python
if you just need a way to store string values by some key, theres the dbm.ndbm and dbm.gnu modules.
if you need a hyper efficient, distributed key value cache, something like memcached for python...
JSON and YAML work well, also.
Depends on what you mean by "efficient"? Size of file? Time required? Amount of code you need to write?
You have the timeit module available to determine what meets your specific criteria for "efficient".