Filtering data in python

Filtering data in python - python

I am working on a web crawler for python that gathers information on posts by users on a site and compares their scores for posts all provided users participate in. It is currently structured so that I receive the following data:
results is a dictionary indexed by username that contains dictionaries of each user's history in a post, points key value structure.
common is a list that starts with all the posts in the dictionary of the first user in results. This list should be filtered down to only the posts all users have in common
points is a dictionary indexed by username that keeps a running total of points on shared posts.
My filtering code is below:
common = list(results.values()[0].keys())
for user in results:
for post_hash in common:
if post_hash not in results[user]:
common.remove(post_hash)
else:
points[user] += results[user][post_hash]
The issue I'm encountering is that this doesn't actually filter out posts that aren't shared, and thus, doesn't provide accurate point values.
What am I doing wrong with my structure, and is there any easier way to find only the common posts?

I think you may have two issues:
Using a list for common means that when you remove an item via common.remove, it will only remove the first item it finds (there could be more)
You're not just adding points for posts shared by all users - you're adding points for users as you encounter them - before you know if that post is shared by everyone or not
Without some actual data to play with, it's a little difficult to write working code, but try this:
# this should give us a list of posts shared by all users
common = set.intersection(*[set(k.keys()) for k in results.values()])
# there's probably a more efficient (functional) way of summing the points
# by user instead of looping, but simple is good.
for user in results:
for post_hash in common:
points[user] += results[user][post_hash]

from collections import Counter
from functools import reduce
posts = []
# Create an array of all the post hashes
for p in results.values():
posts.extend(p.keys())
# use Counter to create a dictionary like object that where the key
# is the post hash and the value is the number of occurrences
posts = Counter(posts)
for user in results:
# Reduce only the posts that show up more than once.
points[user] = reduce(lambda x,y: x+y, (post for post in user if posts[post] > 1))

import functools
iterable = (v.keys() for v in results.values())
common = funtools.reduce(lambda x,y: x & y, iterable)
points = {user: sum(posts[post] for post in common) for user,posts in results.items()}
See if this works.

Related

manipulating the contents of dictionaries inside dictionaries

I am receiving data from a radar on different contacts. each contact has a lat, lon, direction, range and time stamp. and each time hit on a contact will be ID'd such as 1,2,3 etc. for one contact this suggests a dictionary over time. therefore, my dictionary for one contact will look something like this:
{1:[data # t1], 2:[data # t2], 3:[data # t3]}
And as time goes on the dictionary will fill up until ...But there will not be only one contact. there will be several, maybe many. this suggests a dictionary of dictionaries:
{'SSHornblower': {1:[data], 2:[data], 3:[data]},
'Lustania': {1:[], 2:[], 3:[]},
'Queen Mary': {1:[], 2:[], 3:[], 4:[]}}
It is not possible to know before hand how many contacts my radar will find, maybe 3 maybe 300. I cannot come up with names ahead of time for all the possible contacts and names for all the possible dictionaries. Therefore, I came up with the idea that once i nested a dictionary inside the larger dictionary, i could clear it and start over with the new contact. but when i do a clear after i nest one inside another, it clears everything inside the larger dictionary! Is there a way to get around this?

For filling up nested dictionaries a defaultdict can be very useful.
Let's assume you have a function radar() that returns three values:
contact_name
contact_id
contact_data
Then the following would do the job:
from collections import defaultdict
store = defaultdict(dict)
while True:
contact_name, contact_id, contact_data = radar()
store[contact_name][contact_id] = contact_data
So even if there will be a new contact_name that is not yet present in the store, the magic of defaultdict will make sure that an empty nested dict will already be there when you access the store with the new key . Therefore store[new_contact_name][new_contact_id] = new_contact_data will work.

Parsing JSON in Python (Reverse dictionary search)

I'm using Python and "requests" to practice the use of API. I've had success with basic requests and parsing, but having difficulty with list comprehension for a more complex project.
I requested from a server and got a dictionary. From there, I used:
participant_search = (match1_request['participantIdentities'])
To convert the values of the participantIdentities key to get the following data:
[{'player':
{'summonerName': 'Crescent Bladex',
'matchHistoryUri': '/v1/stats/player_history/NA1/226413119',
'summonerId': 63523774,
'profileIcon': 870},
'participantId': 1},
My goal here is to combine the summonerId and participantId to one list. Which is easy normally, but the order of ParticipantIdentities is randomized. So the player I want information on will sometimes be 1st on the list, and other times third.
So I can't use the var = list[0] like how I would normally do.
I have access to summonerId, so I'm thinking I can search the list the summonerId, then somehow collect all the information around it. For instance, if I knew 63523774 then I could find the key for it. From here, is it possible to find the parent list of the key?
Any guidance would be appreciated.
Edit (Clarification):
Here's the data I'm working with: http://pastebin.com/spHk8VP0
At line 1691 is where participant the nested dictionary 'participantIdentities' is. From here, there are 10 dictionaries. These 10 dictionaries include two nested dictionaries, "player" and "participantId".
My goal is to search these 10 dictionaries for the one dictionary that has the summonerId. The summonerId is something I already know before I make this request to the server.
So I'm looking for some sort of "search" method, that goes beyond "true/false". A search method that, if a value is found within an object, the entire dictionary (key:value) is given.

Not sure if I properly understood you, but would this work?
for i in range(len(match1_request['participantIdentities'])):
if(match1_request['participantIdentities'][i]['summonerid'] == '63523774':
# do whatever you want with it.
i becomes the index you're searching for.

ds = match1_request['participantIdentities']
result_ = [d for d in ds if d["player"]["summonerId"] == 12345]
result = result_[0] if result_ else {}
See if it works for you.

You can use a dict comprehension to build a dict wich uses summonerIds as keys:
players_list = response['participantIdentities']
{p['player']['summonerId']: p['participantId'] for p in players_list}

I think what you are asking for is: "How do I get the stats for a given a summoner?"
You'll need a mapping of participantId to summonerId.
For example, would it be helpful to know this?
summoner[1] = 63523774
summoner[2] = 44610089
...
If so, then:
# This is probably what you are asking for:
summoner = {ident['participantId']: ident['player']['summonerId']
for ident in match1_request['participantIdentities']}
# Then you can do this:
summoner_stats = {summoner[p['participantId']]: p['stats']
for p in match1_request['participants']}
# And to lookup a particular summoner's stats:
print summoner_stats[44610089]
(ref: raw data you pasted)

Performance: look in list or sql query

I developed a software with PyQt and sqlite to manage scientific articles. Each article is stored in the sqlite database, and comes from a particular journal.
Sometimes, I need to perform some verifications on the articles of a journal. So I build two lists, one containing the DOI of the articles (a DOI is just a unique id for an article), and one containing booleans, True if the articles are ok, False if the articles are not:
def listDoi(self, journal_abb):
"""Function to get the doi from the database.
Also returns a list of booleans to check if the data are complete"""
list_doi = []
list_ok = []
query = QtSql.QSqlQuery(self.bdd)
query.prepare("SELECT * FROM papers WHERE journal=?")
query.addBindValue(journal_abb)
query.exec_()
while query.next():
record = query.record()
list_doi.append(record.value('doi'))
if record.value('graphical_abstract') != "Empty":
list_ok.append(True)
else:
list_ok.append(False)
return list_doi, list_ok
This function returns the two lists. The lists can contain ~2000 items each. After that, to check if an article is ok, I just check if it is in the two lists.
EDIT: I also need to check if an article is only in list_doi.
So I wonder, because performance matters here: what is faster/better/more economic:
build the two lists, and check if the article is present in the two lists
write the function in another way: checkArticle(doi_article), and the function would perform a SQL query for each article
What about the speed and the space in RAM ? Will the results be different if there are few items, or a lot of them ?

Use time.perf_counter() to determine how long this process takes currently.
time_start = time.perf_counter()
# your code here
print(time.perf_counter() - time_start)
Based on that, if it is going too slow(), you can try each of your options, and time them as well to look for an improvement in performance. As for checking the RAM usage, a simple way is this:
import os
import psutil
process = psutil.Process(os.getpid())
print process.get_memory_info()[0] / float(2 ** 20) # return the memory usage in MB
For a more in-depth memory usage check, look here: https://stackoverflow.com/a/110826/3841261 Always have a way to objectively measure when looking to improve speed/RAM usage/etc.

I would execute one sql query that finds the articles that are OK at once (perhaps in a function called find_articles() or something)
Think of it this way, why do something twice (copy all those rows and work with them) when you could do it once?
You want to basically execute this:
SELECT * from papers where (PAPERID in OTHERTABLE and OTHER RESTRAINT = "WHATEVER")
That's obviously just Pseudocode but I think you can figure it out.

Django: Iterating over multiple instances of a single model in a view

I'm relatively new to Django and face a problem that I couldn't solve yet:
I have two models which look like:
class Item(models.Model):
char1 = models.CharField(max_length=200)
char2 = models.CharField(max_length=200)
class Entry(models.Model):
item = models.ForeignKey(Item)
choice = models.IntegerField()
I have stored many Items in my database, and I want basically in one view to randomly iterate through all the stored Items, and for each Item display char1 and char2 with an IntegerField and a 'next' button, that stores a new Entry (with the actual Item, and typed integer) in my database and directs me to the next (random) Item.
During research I found for example the form wizard and formsets, but this is not what I want, the wizard needs multiple form models that he can display successive, but I want to display (randomly) each instance of only one model (Item) and store one Entry for each.
I hope someone can give me a hint where to look for, because nowhere I found a documentation/tutorial for this use case, and since I'm not very experienced with Django, I can't figure it out at the moment...
Best regards and thanks in advance!

Judging by your title, the problem is the iterating over multiple items from a single (possibly random/unsorted) model.
If I'm not mistaken, what you are looking for is Pagination. From that text, a small example is:
>>> from django.core.paginator import Paginator
>>> objects = ['john', 'paul', 'george', 'ringo']
>>> p = Paginator(objects, 2)
>>> p.count
4
>>> p.num_pages
2
>>> p.page_range
[1, 2]
Although a list is shown above, Paginator can also be used on Django QuerySets, and it's functionality incorporated into Django Templates.
Let me know if this isn't what you're after.
Cheer.

Paulo Bu's answer provides the way to get the random ordering in the first place from Django's AP. The tricky part about what you're doing is that it's not really RESTful to save the particular random ordering of the items between page loads, because that is not stateless. By default, your randomly ordered queryset is going to fall out of existence as soon as you serve the request, and there will be no guarantee that you will circulate through all the items instead of getting repeats and misses. So you'll want to save that ordering. There are a bunch of options for how you might approach this:
Serve the entire randomized list of item IDs with every request, and have the backend serve up the data for the current item by index
Serve all of the data -- full items and entries -- then you can either render everything client-side
Store the randomized list as a session variable
Store a permanent random ordering of the items by adding a float between 0 and 1 to every Item, ordering on that float, and starting at a random index (if you don't care whether each user has the same overall permutation)

'if' element is not in list on Google App Engine

I am building an application for Facebook using Google App Engine. I was trying to compare friends in my user's Facebook account to those already in my application, so I could add them to the database if they are friends in Facebook but not in my application, or not if they are already friends in both. I was trying something like this:
request = graph.request("/me/friends")
user = User.get_by_key_name(self.session.id)
list = []
for x in user.friends:
list.append(x.user)
for friend in request["data"]:
if User.get_by_key_name(friend["id"]):
friendt = User.get_by_key_name(friend["id"])
if friendt.key not in user.friends:
newfriend = Friend(friend = user,
user = friendt,
id = friendt.id)
newfriend.put()
graph.request returns an object with the user's friends. How do I compare content in te two lists of retrieved objects. It doesn't necessarily need to be Facebook related.
(I know this question may be quite silly, but it is really being a pain for me.)

If you upgrade to NDB, the "in" operator will actually work; NDB implements a proper eq operator on Model instances. Note that the key is also compared, so entities that have the same property values but different keys are considered unequal. If you want to ignore the key, consider comparing e1._to_dict() == e2._to_dict().

You should write a custom function to compare your objects, and consider it as a comparison of nested dictionaries. As you will be comparing only the attributes and not functions, you have to do a nested dict comparison.
Reason: All the attributes will be not callable and hopefully, might not start with _, so you have to just compare the remaining elements from the obj.dict and the approach should be bottom up i.e. finish off the nested level objects first (e.g. the main object could host other objects, which will have their own dict)
Lastly, you can consider the accepted answer code here: How to compare two lists of dicts in Python?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.