Parsing data from API - Best way to get all data possible?

Parsing data from API - Best way to get all data possible? - python

I am trying to get crash data from https://data.pa.gov/Public-Safety/Crash-Incident-Details-CY-1997-Current-Annual-Coun/dc5b-gebx using there API, with the documentation here. https://dev.socrata.com/docs/paging.html .
When trying to use python to do this I am only able to get the default amount of records, as below.
response = requests.get("https://data.pa.gov/resource/dc5b-gebx.json?limit=50000")
data = response.json()
pd.DataFrame(data)
When using Limit, the api does not return a value.
I want to return as many values as possible (if not all of them) to do a analysis project with. Bit confused, would appreciate some help here - Thanks!

As stated in the api, you are forgetting the '$', you should be requesting
https://soda.demo.socrata.com/resource/earthquakes.json?$limit=5000.
You can also request more than that, i.e.
https://soda.demo.socrata.com/resource/earthquakes.json?$limit=100000
But this only returns 10,820 results (not sure if this is the limit or the entire dataset).
(You can just use https://data.pa.gov/resource/dc5b-gebx.json?$limit=5 for your dataset, but this takes much longer to load so I am unsure of the limit)

Related

Is there a way to download the entire SBIR awards as a JSON file?

For my work I need to create a Python program to download all the results for "awards" from SBIR automatically.
There are as of now, 171616 results.
I have two possible options. I can download 1,000 at a time but I need to verify that I am not a robot with the reCAPTCHA, therefore I can not automate the download.
Or I could use their API, which would be great! But it only downloads a 100 results when searching for everything available. Is there I way I could iterate through chunks and then compile it into one big JSON file?
This is the documentation.
This is where I say file>save as>filename.json
Any help/advice would really help me out.

Hmm, one way to go is to cycle through possible combinations of parameters that you know. E.g, the API accepts parameters 'year' and 'company' among others. You can start with the earliest year that the award was given, say 1990, and cycle through the years up till present.
https://www.sbir.gov/api/awards.json?year=2010
https://www.sbir.gov/api/awards.json?year=2011
https://www.sbir.gov/api/awards.json?year=2012
this way you'll get up to a 100 awards per year. That's better, however you mentioned that there are 171616 possible results, meaning more than 100 per year, so it won't get all of them. You can use another parameter 'company' in combination.
https://www.sbir.gov/api/awards.json?year=2010&company=luna
https://www.sbir.gov/api/awards.json?year=2011&company=luna
https://www.sbir.gov/api/awards.json?year=2010&company=other_company
https://www.sbir.gov/api/awards.json?year=2011&company=other_company
Now you are getting up to 100 results per company per year. That will give you way more results. You can get the list of companies from another endpoint they provide, which doesn't seem to have a limit on results displayed - https://www.sbir.gov/api/firm.json , watch out though, the json that comes out is absolutely massive and may freeze your laptop. You can use the values from that json for the 'company' parameter and cycle through those.
Of course all of that is a workaround and still doesn't guarantee you getting ALL of the results (although it might get them all). My first action would be to try to contact website admins telling them about your problem. A common thing to do for the apis that return a massive list of results is to provide a page parameter in the url - https://www.sbir.gov/api/awards.json?page=2 so that you can cycle through pages of results. Maybe you can persuade them to do that.

I wish they have better documentation. It seems we can do pagination via:
https://www.sbir.gov/api/awards.json?agency=DOE&start=100
https://www.sbir.gov/api/awards.json?agency=DOE&start=200
https://www.sbir.gov/api/awards.json?agency=DOE&start=300

Writing CSV from Elasticsearch result using python with records exceeding 10000 ?

Im able to create the CSV using the solution provided here:
Export Elasticsearch results into a CSV file
but problem arises when the records exceeds 10000 (size=10000), is there any way to write all the records?

The method you given in your question use elasticsearch's Python API, and es.search do have a 10 thousand docs retrieving limit.
If you want to retrieve data more than 10,000, as suggested by dshockley in the comment, you can try scroll API. Or you can try elasticsearch's scan helpers, which automates a lot work with scroll API. For example, you won't need to get a scroll_id and pass it to the API, which will be necessary if you use scroll directly.
When use helpers.scan, you need to specify index and doc_type in the parameters when call the function, or write them in the query body. Note that, the parameter name is 'query' rather than 'body'.

Youtube analytics API content owner-based queries not returning data

I'm trying to get ad revenue data from the Youtube analytics API. It seems that no queries I make with id=contentOwner==<CONTENT_OWNER_ID> return data: I get a 200 response back with all the column name information, but no rows and no actual data. This occurs even for metrics like comments, which does return data when I use id=channel==<CHANNEL_ID> (i.e., id=channel==<CHANNEL_ID>&metrics=comments&filters=video==<VIDEO_ID> returns the number of comments for that video; id=contentOwner==<CONTENT_OWNER_ID>&metrics=comments&filters=video==<VIDEO_ID> does not.) The problem occurs both in my Python code and in the query explorer (https://developers.google.com/youtube/analytics/v1/reference/reports/query#try-it).
Am I doing something wrong? Is it a secret permissions issue, even though I'm getting a 200 back? Is it a bug?

Storing queryset after fetching it once

I am new to django and web development.
I am building a website with a considerable size of database.
Large amount of data should be shown in many pages, and a lot of this data is repeated. I mean I need to show the same data in many pages.
Is it a good idea to make a query to the database asking for the data in every GET request? it takes many seconds to get the data every time I refresh the page or request another page that has the same data shown.
Is there a way to fetch the data once and store it somewhere and just display it in every page, and only refetch it when some updates are being done.
I thought about the session but I found that it is limited to 5MB which is small for my data.
Any suggestions?
Thank you.

Django's cache - as mentionned by Leistungsabfall - can help, but like most cache systems it has some drawbacks too if you use it naively for this kind of problems (long queries/computations): when the cache expires, the next request will have to recompute the whole thing - which might take some times durring which every new request will trigger a recomputation... Also, proper cache invalidation can be really tricky.
Actually there's no one-size-fits-all answer to your question, the right solution is often a mix of different solutions (code optimisation, caching, denormalisation etc), based on your actual data, how often they change, how much visitors you have, how critical it is to have up-to-date data etc, but the very first steps would be to
check the code fetching the data and find out if there are possible optimisations at this level using QuerySet features (.select_related() / prefetch_related(), values() and/or values_list(), annotations etc) to avoid issues like the "n+1 queries" problem, fetching whole records and building whole model instances when you only need a single field's value, doing computations at the Python level when they could be done at the database level etc
check your db schema's indexes - well used indexes can vastly improve performances, badly used ones can vastly degrade performances...
and of course use the right tools (db query logging, Python's profiler etc) to make sure you identify the real issues.

How to retrieve all post comments/likes via Facebook OpenGraph

I am trying to retrieve comments and likes for specific posts through Facebook's opengraph API. While I do get some information back, it does not always match the comments/likes count mentioned in the post. I guess this can be attributed to the access permissions of the token I'm using. However, I have noticed that results vary depending on the request limit I use, and sometimes I also get duplicate entries between requests.
For example, post 10376464573_150423345118848 has about 14000 likes as of this writing, but I can only retrieve a maximum of around 5000. With the default limit of 25 I can get up to 3021 likes. A value of 100 gives 4501, while limits of 1000, 2000, 3000 and 5000 all return the same number of likes, 4959 (the absolute values don't make too much sense of course, they are just there for comparison).
I have noticed similar results on a smaller scale for comments.
I'm using a simple python script to fetch pages. It goes through the data following the pagination links provided by Facebook, writing each page retrieved to a separate file. Once an empty reply is encountered it stops.
With small limits (e.g. the default of 25), I notice that the number of results returned is monotically decreasing as I go through the pagination links, which seems really odd.
Any thoughts on what could be causing this behavior and how to work around it?

If you are looking for a list of the names of each and every like / comment on a particular post I think you will run up against the API limit (even with pagination).
If you are merely looking for an aggregate number of likes, comments, shares, or link clicks, you'll want to simply use the summary=true param provided in the posts endpoint. Kind of like this:
try:
endpoint = 'https://graph.facebook.com/v2.5/'+postid+'/comments?summary=true&access_token='+apikey
response = requests.get(endpoint)
fb_data = response.json()
return fb_data
You can also retrieve all of the posts of any particular page and their summary data points:
{page_id}/posts?fields=message,likes.limit(1).summary(true)

You can retrieve comments and like count or other information of a particular post using url or api below.
https://graph.facebook.com/{0}/comments?access_token={1}&limit={2}&fields=from,message,message_tags,created_time,id,attachment,like_count,comment_count,parent&order=chronological&filter=stream'.format(post_id,access_token,limit)
As here order specified as chronological, you need to use after parameter as well in the same url whose value one can get in paging.cursor.after section of the first response.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.