The reason why I am asking this question is because I am working with huge datas.
In my algorithm, I basically need something like this:
users_per_document = []
documents_per_user = []
As you understand it from the names of the dictionaries, I need users that clicked a specific document and documents that are clicked by a specific user.
In that case I have "duplicated" datas, and both of them together overflows the memory and my script gets killed after a while. Because I use very large data sets, I have to make it in a efficient way.
I think that is not possible but I need to ask it, is there a way to get all keys of a specific value from dictionary?
Because if there is a way to do that, I will not need one of the dictionaries anymore.
For example:
users_per_document["document1"] obviously returns the appropriate
users,
what I need is users_per_document.getKeys("user1") because this will basically return the same thing with documents_per_user["user1"]
If it is not possible, any suggestion is pleased..
If you are using Python 3.x, you can do the following. If 2.x, just use .iteritems() instead.
user1_values = [key for key,value in users_per_document.items() if value == "user1"]
Note: This does iterate over the whole dictionary. A dictionary isn't really an ideal data structure to get all keys for a specific value, as it will be O(n^2) if you have to perform this operation n times.
I am not very sure about the python, but in general computer science you can solve the problem with the following way;
Basically, you can have three-dimensional array, first index is for users , second index for documents and the third index would be a boolean value.
The boolean value represents if there is relation between the specific user and the specific document.
PS: if you have really sparse matrix, you can make it much more efficient, but it is another story
Related
Good morning!
I have retrieved a bunch of data from the Facebook Graph API using the python facebook library. I'd like to organize all the meaningful data neatly in a csv file for later analysis, but the problem is I'm quite new to python and I don't know how to approach the format in which the data has been retrieved. Basically, I have all the data about a page posts from 01-05-2020 in a list called data_basic:
Every instance of the list represents one post and is a dict of size 8.
Every dict has: 3 dict elements, 3 string elements, 1 bool element and 1 list element.
For example, in order to access the media_type of the first post I must type: data_basic[0]['attachments']['data'][0]['value'], because inside the dict representing the first post I have a dict containing the attachments whose key 'data' contains a list in which I have the values (for example, again, media_type). A nightmare...
Every instance of the dict containing the post data is different... Attachment is the most nested, but something similar happens for the comments or the tags, while message, created time and so on are much more accessible.
I'd like to obtain a csv table whose rows are the various posts and whose columns are the variables (except, of course, the comments, which I'll store in a different file since there's more than one for each post).
How can I approach the problem? The first thing which comes to my mind is a brute force approach using a for cycle trough all the posts and all the variables, filling the dataframe place by place. But I hope there's a quicker and more elegant way... I've come across the json_normalize function, tried something, but I really don't understand how it works and if it can be of any help... Any thoughts?
Thanks in advance!
edit: a couple of screenshot in order to understand better
I've been noodling around with Python for quite a while in my spare time, and while I have sort of understood and definitely used dictionaries, they've always seemed somewhat foreign to me, like I wasn't quite getting them. Maybe it's the name "dictionary" throwing me off, of the fact I started way back when with Basic (I know) which had arrays, but they were quite different.
Can I simply think of a dictionary in Python as nothing more or less than a two-column table where we name the contents of the first column "keys" and the contents of the second column "values"? Is this conceptualization extremely accurate and useful, or problematic?
If the former, I think I can finally swallow the concept in such a way to finally make it more natural to my thinking.
The analogy of a 2-column table might work to some degree but it doesn't cover some important aspects of how and why dictionaries are used in practice.
The comment by #Sayse is more conceptually useful. Think of the dictionary as a physical language dictionary, where the key is the word itself and the value is the word's definition. Two items in the dictionary cannot have the same key but could have the same value. In the analogy of a language dictionary, if two words had the same spelling then they are the same word. However, synonyms can exist where two words which are spelled differently could have the same definition.
The table analogy also doesn't cover the behaviour of a dictionary where the order is not preserved or reliable. In a dictionary, the order does not matter and the item is retrieved by its key. Perhaps another useful analogy is to think of the key as a person's name and the value is the person themselves (and maybe lots of information about them as well). The people are identified by their names but they may be in any given order or location...it doesn't matter, since we know their names we can identify them.
While the order of items in a dictionary may not be preserved, a dictionary has the advantage of having very fast retrieval for a single item. This becomes especially significant as the number of items to lookup grows larger (on the order of thousands or more).
Finally, I would also add that dictionaries can often improve the readability of code. For example, if you wanted create a lookup table of HTML color codes, an API using a dictionary of HTML color names is much more readable and usable than using a list and relying on documentation of indices to retrieve the values.
So if it helps you to conceptualize a dictionary as a table of 2 columns, that is fine, as long as you also keep in mind the rules for their use and the scenarios where they provide some benefit:
Duplicate keys are not allowed
The order of keys is not preserved and therefore not reliable
Retrieving a single item is fast (esp. for many items)
Improved readability of lookup tables
I have a Python application that performs correlation an large files. It stores those in a dict. Depending on the input files, this dict can become really large, to the point where it does not fit into memory anymore. This causes the system to hang, so I want to prevent this.
My idea is that there are always correlations which are not so relevant for the later processing. These could be deleted without changing the overall result too much. I want to do this when I have not much memory left.
Hence, I check for available memory periodically. If it becomes too few (say, less than 300MB), if delete the irrelevant correlations to gain more space. That's the theory.
Now for my problem: In Python, you cannot delete from a dict while iterating over it. But this is exactly what I need to do, since I have to check each dict entry for relevancy before deleting.
The usual solution would be to create a copy of the dict for iteration, or to create a new dict containing only the elements that I want to preserve. However, the dict might be several GBs big and there are only a few hundred MB of free memory left. So I cannot do much copying since that may again cause the system to hang.
Here I am stuck. Can anyone think of a better method to achieve what I need? If in-place deletion of dict entries is absolutely not possible while iterating, maybe there is some workaround that could save me?
Thanks in advance!
EDIT -- some more information about the dict:
The keys are tuples specifying the values by which the data is correlated.
The values are dicts containing the correlated date. The keys of these dicts are always strings, the values are numbers (int or float).
I am checking for relevancy by comparing the number values in the value-dicts with certain thresholds. If the values are below the thresholds, the particular correlation can be dropped.
I do not think that your solution to the problem is prudent.
If you have that much data I recommend you find some bigger tools in your toolbox, a suggestion would be to let a local Redis server help you out.
Take a look at redis-collections, that will provide you with a dictionary like object with a redis backend, giving you a sustainable solution.
>>> from redis_collections import Dict
>>> d = Dict()
>>> d['answer'] = 42
>>> d
<redis_collections.Dict at fe267c1dde5d4f648e7bac836a0168fe {'answer': 42}>
>>> d.items()
[('answer', 42)]
Best of luck!
Are the keys large? If not, you can loop over the dict to determine which entries should be deleted; store the key for each such entry in a list. Then loop over those keys and delete them from the dict.
I am developing AI to perform MDP, I am getting states(just integers in this case) and assigning it a value, and I am going to be doing this a lot. So I am looking for a data structure that can hold(no need for delete) that information and will have a very fast get/update function. Is there something faster than the regular dictionary? I am looking for anything really so native python, open sourced, I just need fast getting.
Using a Python dictionary is the way to go.
You're saying that all your keys are integers? In that case, it might be faster to use a list and just treat the list indices as the key values. However, you'd have to make sure that you never delete or add list items; just start with as many as you think you'll need, setting them all equal to None, as shown:
mylist = [None for i in xrange(totalitems)]
Then, when you need to "add" an item, just set the corresponding value.
Note that this probably won't actually gain you much in terms of actual efficiency, and it might be more confusing than just using a dictionary.
For 10,000 items, it turns out (on my machine, with my particular test case) that accessing each one and assigning it to a variable takes about 334.8 seconds with a list and 565 seconds with a dictionary.
If you want a rapid prototype, use python. And don't worry about speed.
If you want to write fast scientific code (and you can't build on fast native libraries, like LAPACK for linear algebra stuff) write it in C, C++ (maybe only to call from Python). If fast instead of ultra-fast is enough, you can also use Java or Scala.
I am trying to implement a data structure which allows rapid look-ups based on keys.
The python dict is great when my look-ups involve an equality
(e.g. key == somevalue translates to datadict[somevalue].
The problem is that I also need to be able to efficiently look up keys based on a more complex comparison, e.g. key > 50, or key.startswith('abc').
Obviously I can't use the same solution in both cases, but at the moment I can't figure out how to solve either case. Can anyone suggest a way of doing this?
It doesn't sound like you want a hash algorithm - instead some form of binary tree. Or even a list which you use the bisect module with. It'd be worth looking at: Python's standard library - is there a module for balanced binary tree?
Another option (depending on your data), would be to use use an in-memory sqlite3 database and create appropriate indices for possible lookups -- but you'll trade performance/memory and SQL syntax for flexibility...
Put all data items in a list.
Sort the list on the key.
Use binary search to efficiently find items where key > 50 or where key.startswith('abc').
Of course, this only pays off if you have really very many data items. If you have not so many, simply loop through the list and apply your condition to every key.