Store and Scrape Over Time

Store and Scrape Over Time - python

I'm brand new here and brand new to Python and programming in general. I wrote a simple script today that I'm pretty proud of as a new beginner. I used BS4 and Requests to scrape some data from a website. I put all of the data in dictionaries inside a list. The same key/value pairs exist for every list item. For simplicity, I'm left with something like this:
[{'country': 'us', 'state':'new york', 'people':50},{'country':'us', 'state':'california','people':30']}
Like I said, pretty simple, but then I can turn it into a Pandas dataframe and everything is organized with a few hundred different dictionaries inside the list. My next step is to do run this scrape every hour for 5 hours--and the only thing that changes is the value of the 'people' key. All of the sudden I'm not sure a list of lists of dictionaries (did I say that right?!) is a great idea. Plus, I really only need to get the updated values of 'people' from the webpage. Is this something I can realistically do with built in Python lists and dictionaries? I don't know much about databases, but I'm thinking that maybe SQLite might be good to use. I really only know about it in concept but haven't worked with it. Thoughts?
Ideally, after several scrapes, I would have easy access to the data to say, see 'people' in 'new york' over time. Or find at what time 'california' had the highest number of people. And then I could plot the data in 1000 different ways! I'd love any guidance or direction here. Thanks a bunch!

You could create a Python class, like this:
class StateStats:
def __init__(self, country, state, people):
self.country = country
self.state = state
self.people = people
def update():
# Do whatever your update script is here
# Except, update the value self.people when it changes
# Like this: self.people = newPeopleValueAsAVariable
And then create instances of it like this:
# For each country you have scraped, make a new instance of this class
# This assumes that the list you gathered is stored in a variable named my_list
state_stats_list = []
for dictionary in my_list:
state_stats_list.append(
StateStats(
dictionary['country'],
dictionary['state'],
dictionary['people']
)
)
# Or, instead, you can just create the class instances
# when you scrape the webpage, instead of creating a
# list and then creating another list of classes from that list
You could also use a database like SQLite, but I think this will be fine for your purpose. Hope this helps!

Related

Python- Insert new values into 'nested' list?

What I'm trying to do isn't a huge problem in php, but I can't find much assistance for Python.
In simple terms, from a list which produces output as follows:
{"marketId":"1.130856098","totalAvailable":null,"isMarketDataDelayed":null,"lastMatchTime":null,"betDelay":0,"version":2576584033,"complete":true,"runnersVoidable":false,"totalMatched":null,"status":"OPEN","bspReconciled":false,"crossMatching":false,"inplay":false,"numberOfWinners":1,"numberOfRunners":10,"numberOfActiveRunners":8,"runners":[{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":2.8,"size":34.16},{"price":2.76,"size":200},{"price":2.5,"size":237.85}],"availableToLay":[{"price":2.94,"size":6.03},{"price":2.96,"size":10.82},{"price":3,"size":33.45}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832765}...
All I want to do is add in an extra field, containing the 'runner name' in the data set below, into each of the 'runners' sub lists from the initial data set, based on selection_id=selectionId.
So initially I iterate through the full dataset, and then create a separate list to get the runner name from the runner id (I should point out that runnerId===selectionId===selection_id, no idea why there are multiple names are used), this works fine and the code is shown below:
for market_book in market_books:
market_catalogues = trading.betting.list_market_catalogue(
market_projection=["RUNNER_DESCRIPTION", "RUNNER_METADATA", "COMPETITION", "EVENT", "EVENT_TYPE", "MARKET_DESCRIPTION", "MARKET_START_TIME"],
filter=betfairlightweight.filters.market_filter(
market_ids=[market_book.market_id],
),
max_results=100)
data = []
for market_catalogue in market_catalogues:
for runner in market_catalogue.runners:
data.append(
(runner.selection_id, runner.runner_name)
)
So as you can see I have the data in data[], but what I need to do is add it to the initial data set, based on the selection_id.
I'm more comfortable with Php or Javascript, so apologies if this seems a bit simplistic, but the code snippets I've found on-line only seem to assist with very simple Python lists and nothing 'nested' (to me the structure seems similar to a nested array).
As per the request below, here is the full list:
{"marketId":"1.130856098","totalAvailable":null,"isMarketDataDelayed":null,"lastMatchTime":null,"betDelay":0,"version":2576584033,"complete":true,"runnersVoidable":false,"totalMatched":null,"status":"OPEN","bspReconciled":false,"crossMatching":false,"inplay":false,"numberOfWinners":1,"numberOfRunners":10,"numberOfActiveRunners":8,"runners":[{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":2.8,"size":34.16},{"price":2.76,"size":200},{"price":2.5,"size":237.85}],"availableToLay":[{"price":2.94,"size":6.03},{"price":2.96,"size":10.82},{"price":3,"size":33.45}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832765},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":20,"size":3},{"price":19.5,"size":26.36},{"price":19,"size":2}],"availableToLay":[{"price":21,"size":13},{"price":22,"size":2},{"price":23,"size":2}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832767},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":11,"size":9.75},{"price":10.5,"size":3},{"price":10,"size":28.18}],"availableToLay":[{"price":11.5,"size":12},{"price":13.5,"size":2},{"price":14,"size":7.75}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832766},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":48,"size":2},{"price":46,"size":5},{"price":42,"size":5}],"availableToLay":[{"price":60,"size":7},{"price":70,"size":5},{"price":75,"size":10}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832769},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":18.5,"size":28.94},{"price":18,"size":5},{"price":17.5,"size":3}],"availableToLay":[{"price":21,"size":20},{"price":23,"size":2},{"price":24,"size":2}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832768},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":4.3,"size":9},{"price":4.2,"size":257.98},{"price":4.1,"size":51.1}],"availableToLay":[{"price":4.4,"size":20.97},{"price":4.5,"size":30},{"price":4.6,"size":16}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832771},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":24,"size":6.75},{"price":23,"size":2},{"price":22,"size":2}],"availableToLay":[{"price":26,"size":2},{"price":27,"size":2},{"price":28,"size":2}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":12832770},{"status":"ACTIVE","ex":{"tradedVolume":[],"availableToBack":[{"price":5.7,"size":149.33},{"price":5.5,"size":29.41},{"price":5.4,"size":5}],"availableToLay":[{"price":6,"size":85},{"price":6.6,"size":5},{"price":6.8,"size":5}]},"sp":{"nearPrice":null,"farPrice":null,"backStakeTaken":[],"layLiabilityTaken":[],"actualSP":null},"adjustmentFactor":null,"removalDate":null,"lastPriceTraded":null,"handicap":0,"totalMatched":null,"selectionId":10064909}],"publishTime":1551612312125,"priceLadderDefinition":{"type":"CLASSIC"},"keyLineDescription":null,"marketDefinition":{"bspMarket":false,"turnInPlayEnabled":false,"persistenceEnabled":false,"marketBaseRate":5,"eventId":"28180290","eventTypeId":"2378961","numberOfWinners":1,"bettingType":"ODDS","marketType":"NONSPORT","marketTime":"2019-03-29T00:00:00.000Z","suspendTime":"2019-03-29T00:00:00.000Z","bspReconciled":false,"complete":true,"inPlay":false,"crossMatching":false,"runnersVoidable":false,"numberOfActiveRunners":8,"betDelay":0,"status":"OPEN","runners":[{"status":"ACTIVE","sortPriority":1,"id":10064909},{"status":"ACTIVE","sortPriority":2,"id":12832765},{"status":"ACTIVE","sortPriority":3,"id":12832766},{"status":"ACTIVE","sortPriority":4,"id":12832767},{"status":"ACTIVE","sortPriority":5,"id":12832768},{"status":"ACTIVE","sortPriority":6,"id":12832770},{"status":"ACTIVE","sortPriority":7,"id":12832769},{"status":"ACTIVE","sortPriority":8,"id":12832771},{"status":"LOSER","sortPriority":9,"id":10317013},{"status":"LOSER","sortPriority":10,"id":10317010}],"regulators":["MR_INT"],"countryCode":"GB","discountAllowed":true,"timezone":"Europe\/London","openDate":"2019-03-29T00:00:00.000Z","version":2576584033,"priceLadderDefinition":{"type":"CLASSIC"}}}

i think i understand what you are trying to do now
first hold your data as a python object (you gave us a json object)
import json
my_data = json.loads(my_json_string)
for item in my_data['runners']:
item['selectionId'] = [item['selectionId'], my_name_here]
the thing is that my_data['runners'][i]['selectionId'] is a string, unless you want to concat the name and the id together, you should turn it into a list or even a dictionary
each item is a dicitonary so you can always also a new keys to it
item['new_key'] = my_value

So, essentially this works...with one exception...I can see from the print(...) in the loop that the attribute is updated, however what I can't seem to do is then see this update outside the loop.
mkt_runners = []
for market_catalogue in market_catalogues:
for r in market_catalogue.runners:
mkt_runners.append((r.selection_id, r.runner_name))
for market_book in market_books:
for runner in market_book.runners:
for x in mkt_runners:
if runner.selection_id in x:
setattr(runner, 'x', x[1])
print(market_book.market_id, runner.x, runner.selection_id)
print(market_book.json())
So the print(market_book.market_id.... displays as expected, but when I print the whole list it shows the un-updated version. I can't seem to find an obvious solution, which is odd, as it seems like a really simple thing (I tried messing around with indents, in case that was the problem, but it doesn't seem to be, its like its not refreshing the market_book list post update of the runners sub list)!

How can I query for multiple properties not known in advance using Expando?

I'm making an application in which a user can create categories to put items in them. The items share some basic properties, but the rest of them are defined by the category they belong to. The problem is that both the category and it's special properties are created by the user.
For instance, the user may create two categories: books and buttons. In the 'book' category he may create two properties: number of pages and author. In the buttons category he may create different properties: number of holes and color.
Initially, I placed these properties in a JsonProperty inside the Item. While this works, it means that I query the Datastore just by specifying the category that I am looking for and then I have to filter the results of the query in the code. For example, if I'm looking for all the books whose author is Carl Sagan, I would query the Item class with category == books and the loop through the results to keep only those that match the author.
While I don't really expect to have that many items per category (probably in the hundreds, unlikely to get to one thousand), this looks inefficient. So I tried to use ndb.Expando to make those special properties real properties that are indexed. I did this, adding the corresponding special properties to the item when putting it to the Datastore. So if the user creates an Item in the 'books' category and previously created in that category the special property 'author', an Item is saved with the special property expando_author = author in it. It worked as I expected until this point (dev server).
The real problem though became visible when I did some queries. While they worked in the dev server, they created composite indexes for each special/expando property, even if the query filters were equality only. And while each category can have at most five properties, it is evident that it can easily get out of control.
Example query:
items = Item.query()
for p in properties:
items = items.filter(ndb.GenericProperty(p)==properties[p])
items.fetch()
Now, since I don't know in advance what the properties will be (though I will limit it to 5), I can't build the indexes before uploading the application, and even if I knew it would probably mean having more indexes that I'm comfortable with. Is Expando the wrong tool for what I'm trying to do? Should I just keep filtering the results in the code using the JsonProperty? I would greatly appreciate any advice I can get.
PD. To make this post shorter I omitted a few details about what I did, if you need to know something I may have left out just ask in the comments.

Consider storing category's properties in a single list property prefixed with category property name.
Like (forget me I forgot exact Python syntax, switched to Go)
class Item():
props = StringListProperty()
book = Item(category='book', props=['title:Carl Sagan'])
button = Item(category='button', props=['wholes:5'])
Then you can do have a single composite index on category+props and do queries like this:
def filter_items(category, propName, propValue):
Item.filter(Item.category == category).filter(Item.props==propName+':'+propValue)
And you would need a function on Item to get property values cleaned up from prop names.

manipulating the contents of dictionaries inside dictionaries

I am receiving data from a radar on different contacts. each contact has a lat, lon, direction, range and time stamp. and each time hit on a contact will be ID'd such as 1,2,3 etc. for one contact this suggests a dictionary over time. therefore, my dictionary for one contact will look something like this:
{1:[data # t1], 2:[data # t2], 3:[data # t3]}
And as time goes on the dictionary will fill up until ...But there will not be only one contact. there will be several, maybe many. this suggests a dictionary of dictionaries:
{'SSHornblower': {1:[data], 2:[data], 3:[data]},
'Lustania': {1:[], 2:[], 3:[]},
'Queen Mary': {1:[], 2:[], 3:[], 4:[]}}
It is not possible to know before hand how many contacts my radar will find, maybe 3 maybe 300. I cannot come up with names ahead of time for all the possible contacts and names for all the possible dictionaries. Therefore, I came up with the idea that once i nested a dictionary inside the larger dictionary, i could clear it and start over with the new contact. but when i do a clear after i nest one inside another, it clears everything inside the larger dictionary! Is there a way to get around this?

For filling up nested dictionaries a defaultdict can be very useful.
Let's assume you have a function radar() that returns three values:
contact_name
contact_id
contact_data
Then the following would do the job:
from collections import defaultdict
store = defaultdict(dict)
while True:
contact_name, contact_id, contact_data = radar()
store[contact_name][contact_id] = contact_data
So even if there will be a new contact_name that is not yet present in the store, the magic of defaultdict will make sure that an empty nested dict will already be there when you access the store with the new key . Therefore store[new_contact_name][new_contact_id] = new_contact_data will work.

Python/Django newbie attempting to combine two lists to a dictionary

I am completely new to python and django, so forgive me if this seems like a dumb question. I have a feeling I don't really understand what I am dealing with even though I know what I want my result to look like.
I need to have a Dictionary that has two sets of things, and for simplicity I am just going to say that my keys below correspond to an already working query - these are field names. All I want to happen is pre-populate a Django ModelForm with the data that ends up in the list called values
below. The data fetch for values works perfectly. The only reason I want a dictionary is because that is what the Django ModelForm wants passed to it in order to have my initial data for the user to change or accept. I believe I am getting tripped up because this is the first time I have ever dealt with dictionaries, especially in python under django. I've researched using the dict zip technique and it runs without error, but everything ends up being bound to the key I've called name which is of course coming from the field name of a database table.
If I put a import pdb ; pdb.set_trace right before my return statement and ask the debugger what data looks like here is what I get:
{'name': ['IB Report', 'IB Daily Report', 'Add description']}
Can someone set me on the right path?
def mysql_ser_reports_mgmt(id):
keys = ['name','subject','description']
values = []
for report in Report.objects.filter(id=id):
values.append([str(report.name), str(report.subject), str(report.description)])
data = dict(zip(keys,values))
return data

Django already has a filter method which does what you are looking for. Look into values()
So, you would do:
return Report.objects.filter(id=id).values('name', 'subject', 'description')
The output would be something like this:
[{'name': 'name', 'subject': '<Subject>', 'description': '<Description>'},
{'name': 'name2', 'subject': '<Subject2>', 'description': '<Description2>'},
]

one-many relationship-google datastore-python

I have two models like below:-
class Food(db.Model):
foodname=db.StringProperty()
cook=db.StringProperty()
class FoodReview(db.Model):
thereview=db.StringProperty()
reviews=db.ReferenceProperty(Food,collections_name='thefoodreviews')
I go ahead and create an entity:-
s=Food(foodname='apple',cook='Alice')`
s.put()
When someone writes a review, the function which does the below comes in play:
theentitykey=db.Query(Food,keys_only=True).filter('foodname =','apple').get()
r=FoodReview()
r.reviews=theentitykey #this is the key of the entity retrieved above and stored as a ref property here
r.thereview='someones review' #someone writes a review
r.put()
Now the problem is how to retrieve these reviews. If I know the key of the entity, I can just do the below:-
theentityobject=db.get(food-key) # but then the issue is how to know the key
for elem in theentityobject.thefoodreviews:
print elem.thereview
else I can do something like this:-
theentityobj=db.Query(Food).filter('foodname =','apple').get()
and then iterate as above, but are the above two ways the correct ones?

If to get the food you're always doing db.Query(Food).filter('foodname =','apple') then it looks like your foodname is your key...
Why not just use it as a key_name?
Then, you can even fetch the reviews without fetching the food itself:
key = db.Key.from_path('food', 'apple')
reviews = FoodReview.all().filter("reviews =", key)

The second method looks exactly like what AppEngine tutorial advices.
Seems like the right thing to do, if you want to find all reviews for a particular foodname.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.