one-many relationship-google datastore-python - python

I have two models like below:-
class Food(db.Model):
foodname=db.StringProperty()
cook=db.StringProperty()
class FoodReview(db.Model):
thereview=db.StringProperty()
reviews=db.ReferenceProperty(Food,collections_name='thefoodreviews')
I go ahead and create an entity:-
s=Food(foodname='apple',cook='Alice')`
s.put()
When someone writes a review, the function which does the below comes in play:
theentitykey=db.Query(Food,keys_only=True).filter('foodname =','apple').get()
r=FoodReview()
r.reviews=theentitykey #this is the key of the entity retrieved above and stored as a ref property here
r.thereview='someones review' #someone writes a review
r.put()
Now the problem is how to retrieve these reviews. If I know the key of the entity, I can just do the below:-
theentityobject=db.get(food-key) # but then the issue is how to know the key
for elem in theentityobject.thefoodreviews:
print elem.thereview
else I can do something like this:-
theentityobj=db.Query(Food).filter('foodname =','apple').get()
and then iterate as above, but are the above two ways the correct ones?

If to get the food you're always doing db.Query(Food).filter('foodname =','apple') then it looks like your foodname is your key...
Why not just use it as a key_name?
Then, you can even fetch the reviews without fetching the food itself:
key = db.Key.from_path('food', 'apple')
reviews = FoodReview.all().filter("reviews =", key)

The second method looks exactly like what AppEngine tutorial advices.
Seems like the right thing to do, if you want to find all reviews for a particular foodname.

Related

Store and Scrape Over Time

I'm brand new here and brand new to Python and programming in general. I wrote a simple script today that I'm pretty proud of as a new beginner. I used BS4 and Requests to scrape some data from a website. I put all of the data in dictionaries inside a list. The same key/value pairs exist for every list item. For simplicity, I'm left with something like this:
[{'country': 'us', 'state':'new york', 'people':50},{'country':'us', 'state':'california','people':30']}
Like I said, pretty simple, but then I can turn it into a Pandas dataframe and everything is organized with a few hundred different dictionaries inside the list. My next step is to do run this scrape every hour for 5 hours--and the only thing that changes is the value of the 'people' key. All of the sudden I'm not sure a list of lists of dictionaries (did I say that right?!) is a great idea. Plus, I really only need to get the updated values of 'people' from the webpage. Is this something I can realistically do with built in Python lists and dictionaries? I don't know much about databases, but I'm thinking that maybe SQLite might be good to use. I really only know about it in concept but haven't worked with it. Thoughts?
Ideally, after several scrapes, I would have easy access to the data to say, see 'people' in 'new york' over time. Or find at what time 'california' had the highest number of people. And then I could plot the data in 1000 different ways! I'd love any guidance or direction here. Thanks a bunch!
You could create a Python class, like this:
class StateStats:
def __init__(self, country, state, people):
self.country = country
self.state = state
self.people = people
def update():
# Do whatever your update script is here
# Except, update the value self.people when it changes
# Like this: self.people = newPeopleValueAsAVariable
And then create instances of it like this:
# For each country you have scraped, make a new instance of this class
# This assumes that the list you gathered is stored in a variable named my_list
state_stats_list = []
for dictionary in my_list:
state_stats_list.append(
StateStats(
dictionary['country'],
dictionary['state'],
dictionary['people']
)
)
# Or, instead, you can just create the class instances
# when you scrape the webpage, instead of creating a
# list and then creating another list of classes from that list
You could also use a database like SQLite, but I think this will be fine for your purpose. Hope this helps!

How can I query for multiple properties not known in advance using Expando?

I'm making an application in which a user can create categories to put items in them. The items share some basic properties, but the rest of them are defined by the category they belong to. The problem is that both the category and it's special properties are created by the user.
For instance, the user may create two categories: books and buttons. In the 'book' category he may create two properties: number of pages and author. In the buttons category he may create different properties: number of holes and color.
Initially, I placed these properties in a JsonProperty inside the Item. While this works, it means that I query the Datastore just by specifying the category that I am looking for and then I have to filter the results of the query in the code. For example, if I'm looking for all the books whose author is Carl Sagan, I would query the Item class with category == books and the loop through the results to keep only those that match the author.
While I don't really expect to have that many items per category (probably in the hundreds, unlikely to get to one thousand), this looks inefficient. So I tried to use ndb.Expando to make those special properties real properties that are indexed. I did this, adding the corresponding special properties to the item when putting it to the Datastore. So if the user creates an Item in the 'books' category and previously created in that category the special property 'author', an Item is saved with the special property expando_author = author in it. It worked as I expected until this point (dev server).
The real problem though became visible when I did some queries. While they worked in the dev server, they created composite indexes for each special/expando property, even if the query filters were equality only. And while each category can have at most five properties, it is evident that it can easily get out of control.
Example query:
items = Item.query()
for p in properties:
items = items.filter(ndb.GenericProperty(p)==properties[p])
items.fetch()
Now, since I don't know in advance what the properties will be (though I will limit it to 5), I can't build the indexes before uploading the application, and even if I knew it would probably mean having more indexes that I'm comfortable with. Is Expando the wrong tool for what I'm trying to do? Should I just keep filtering the results in the code using the JsonProperty? I would greatly appreciate any advice I can get.
PD. To make this post shorter I omitted a few details about what I did, if you need to know something I may have left out just ask in the comments.
Consider storing category's properties in a single list property prefixed with category property name.
Like (forget me I forgot exact Python syntax, switched to Go)
class Item():
props = StringListProperty()
book = Item(category='book', props=['title:Carl Sagan'])
button = Item(category='button', props=['wholes:5'])
Then you can do have a single composite index on category+props and do queries like this:
def filter_items(category, propName, propValue):
Item.filter(Item.category == category).filter(Item.props==propName+':'+propValue)
And you would need a function on Item to get property values cleaned up from prop names.

Python Praw skipping sticky in subreddits

I am trying to loop through subreddits, but want to ignore the sticky posts at the top. I am able to print the first 5 posts, unfortunately including the stickies. Various pythonic methods of trying to skip these have failed. Two different examples of my code below.
subreddit = reddit.subreddit(sub)
for submission in subreddit.hot(limit=5):
# If we haven't replied to this post before
if submission.id not in posts_replied_to:
##FOOD
if subreddit == 'food':
if 'pLEASE SEE' in submission.title:
pass
if "please vote" in submission.title:
pass
else:
print(submission.title)
if re.search("please vote", submission.title, re.IGNORECASE):
pass
else:
print(submission.title)
I noticed a sticky tag in the documents but not sure exactly how to use it. Any help is appreciated.
Submissions which are stickied have a sticked attribute that evaluates to True. Add the following to your loop, and you should be good to go.
if submission.stickied:
continue
In general, I recommend checking the available attributes on the objects you are working with to see if there is something usable. See: Determine Available Attributes of an Object
It looks like you can get the id of a stickied post based on docs. So perhaps you could get the id(s) of the stickied post(s) (note that with the 'number' parameter of the sticky method you can say give me the first, or second, or third, stickied post; use this to your advantage to get all of the stickied posts) and for each submission that you are going to pull, first check its id against the stickied ids.
Example:
# assuming there are no more than three stickies...
stickies = [reddit.subreddit("chicago").sticky(i).id for i in range(1,4)]
and then when you want to make sure a given post isn't stickied, use:
if post.id not in stickies:
do something
It looks like, were there fewer than three, this would give you a list with duplicate ids, which won't be a problem.
As an addendum to #Al Avery's answer, you can do a complete search for the IDs of all stickies on a given subreddit by doing something like
def get_all_stickies(sub):
stickies = set()
for i in itertools.count(1):
try:
sid = sub.sticky(i)
except pawcore.NotFound:
break
if sid in stickies:
break
stickies.add(sid)
return stickies
This function takes into account that the documentation lead one to expect an error if an invalid index is supplied to stick, while the actual behavior seems to be that a duplicate ID is returned. Using a set instead of a list makes lookup faster if you have a large number of stickies. You would use the function as
subreddit = reddit.subreddit(sub)
stickies = get_all_stickies(subreddit)
for submission in subreddit.hot(limit=5):
if submission.id not in posts_replied_to and submission.id not in stickies:
print(submission.title)

Parsing JSON in Python (Reverse dictionary search)

I'm using Python and "requests" to practice the use of API. I've had success with basic requests and parsing, but having difficulty with list comprehension for a more complex project.
I requested from a server and got a dictionary. From there, I used:
participant_search = (match1_request['participantIdentities'])
To convert the values of the participantIdentities key to get the following data:
[{'player':
{'summonerName': 'Crescent Bladex',
'matchHistoryUri': '/v1/stats/player_history/NA1/226413119',
'summonerId': 63523774,
'profileIcon': 870},
'participantId': 1},
My goal here is to combine the summonerId and participantId to one list. Which is easy normally, but the order of ParticipantIdentities is randomized. So the player I want information on will sometimes be 1st on the list, and other times third.
So I can't use the var = list[0] like how I would normally do.
I have access to summonerId, so I'm thinking I can search the list the summonerId, then somehow collect all the information around it. For instance, if I knew 63523774 then I could find the key for it. From here, is it possible to find the parent list of the key?
Any guidance would be appreciated.
Edit (Clarification):
Here's the data I'm working with: http://pastebin.com/spHk8VP0
At line 1691 is where participant the nested dictionary 'participantIdentities' is. From here, there are 10 dictionaries. These 10 dictionaries include two nested dictionaries, "player" and "participantId".
My goal is to search these 10 dictionaries for the one dictionary that has the summonerId. The summonerId is something I already know before I make this request to the server.
So I'm looking for some sort of "search" method, that goes beyond "true/false". A search method that, if a value is found within an object, the entire dictionary (key:value) is given.
Not sure if I properly understood you, but would this work?
for i in range(len(match1_request['participantIdentities'])):
if(match1_request['participantIdentities'][i]['summonerid'] == '63523774':
# do whatever you want with it.
i becomes the index you're searching for.
ds = match1_request['participantIdentities']
result_ = [d for d in ds if d["player"]["summonerId"] == 12345]
result = result_[0] if result_ else {}
See if it works for you.
You can use a dict comprehension to build a dict wich uses summonerIds as keys:
players_list = response['participantIdentities']
{p['player']['summonerId']: p['participantId'] for p in players_list}
I think what you are asking for is: "How do I get the stats for a given a summoner?"
You'll need a mapping of participantId to summonerId.
For example, would it be helpful to know this?
summoner[1] = 63523774
summoner[2] = 44610089
...
If so, then:
# This is probably what you are asking for:
summoner = {ident['participantId']: ident['player']['summonerId']
for ident in match1_request['participantIdentities']}
# Then you can do this:
summoner_stats = {summoner[p['participantId']]: p['stats']
for p in match1_request['participants']}
# And to lookup a particular summoner's stats:
print summoner_stats[44610089]
(ref: raw data you pasted)

Add a field to existing document in CouchDB

I have a database with a bunch of regular documents that look something like this (example from wiki):
{
"_id":"some_doc_id",
"_rev":"D1C946B7",
"Subject":"I like Plankton",
"Author":"Rusty",
"PostedDate":"2006-08-15T17:30:12-04:00",
"Tags":["plankton", "baseball", "decisions"],
"Body":"I decided today that I don't like baseball. I like plankton."
}
I'm working in Python with couchdb-python and I want to know if it's possible to add a field to each document. For example, if I wanted to have a "Location" field or something like that.
Thanks!
Regarding IDs
Every document in couchdb has an id, whether you set it or not. Once the document is stored you can access it through the doc._id field.
If you want to set your own ids you'll have to assign the id value to doc._id. If you don't set it, then couchdb will assign a uuid.
If you want to update a document, then you need to make sure you have the same id and a valid revision. If say you are working from a blog post and the user adds the Location, then the url of the post may be a good id to use. You'd be able to instantly access the document in this case.
So what's a revision
In your code snippet above you have the doc._rev element. This is the identifier of the revision. If you save a document with an id that already exists, couchdb requires you to prove that the document is still the valid doc and that you are not trying to overwrite someone else's document.
So how do I update a document
If you have the id of your document, you can just access each document by using the db.get(id) function. You can then update the document like this:
doc = db.get(id)
doc['Location'] = "On a couch"
db.save(doc)
I have an example where I store weather forecast data. I update the forecasts approximately every 2 hours. A separate process is looking for data that I get from a different provider looking at characteristics of tweets on the day.
This looks something like this.
doc = db.get(id)
doc_with_loc = GetLocationInformationFromOtherProvider(doc) # takes about 40 seconds.
doc_with_loc["_rev"] = doc["_rev"]
db.save(doc_with_loc) # This will fail if weather update has also updated the file.
If you have concurring processes, then the _rev will become invalid, so you have to have a failsave, eg. this could do:
doc = db.get(id)
doc_with_loc = GetLocationInformationFromAltProvider(doc)
update_outstanding = true
while update_outstanding:
doc = db.get(id) //reretrieve this to get
doc_with_loc["_rev"] = doc["_rev"]
update_outstanding = !db.save(doc_with_loc)
So how do I get the Ids?
One option suggested above is that you actively set the id, so you can retrieve it. Ie. if a user sets a given location that is attached to a URL, use the URL. But you may not know which document you want to update - or even have a process that finds all the document that don't have a location and assign one.
You'll most likely be using a view for this. Views have a mapper and a reducer. You'll use the first one, forget about the last one. A view with a mapper does the following:
It returns a simplyfied/transformed way of looking at your data. You can return multiple values per data or skip some. It gives the data you emit a key, and if you use the _include_docs function it will give you the document (with _id and rev alongside).
The simplest view is the default view db.view('_all_docs') this will return all documents and you may not want to update all of them. Views for example will be stored as a document as well when you define these.
The next simple way is to have view that only returns items that are of the type of the document. I tend to have a _type="article in my database. Think of this as marking that a document belongs to a certain table if you had stored them in a relational database.
Finally you can filter elements that have a location so you'd have a view where you can iterate over all those docs that still need a location and identify this in a separate process. The best documentation on writing view can be found here.

Categories