Tag great expectation results - python

When using an ExpectationStore such as S3, is there a field you can include as a way to date/tag the GE json result such that it shows up on the GE UI to be part of specific dated batch. For example you run daily or hourly jobs, and you would like to be able to rerun/backfill those jobs and see the GE results by day/hour overtime refreshed accordingly. I would like for the UI to show the results by my date tag and not when by when I store the result. I'm not sure how to accomplish this as the documentation doesn't really give any details.

Related

Is there a way to download the entire SBIR awards as a JSON file?

For my work I need to create a Python program to download all the results for "awards" from SBIR automatically.
There are as of now, 171616 results.
I have two possible options. I can download 1,000 at a time but I need to verify that I am not a robot with the reCAPTCHA, therefore I can not automate the download.
Or I could use their API, which would be great! But it only downloads a 100 results when searching for everything available. Is there I way I could iterate through chunks and then compile it into one big JSON file?
This is the documentation.
This is where I say file>save as>filename.json
Any help/advice would really help me out.
Hmm, one way to go is to cycle through possible combinations of parameters that you know. E.g, the API accepts parameters 'year' and 'company' among others. You can start with the earliest year that the award was given, say 1990, and cycle through the years up till present.
https://www.sbir.gov/api/awards.json?year=2010
https://www.sbir.gov/api/awards.json?year=2011
https://www.sbir.gov/api/awards.json?year=2012
this way you'll get up to a 100 awards per year. That's better, however you mentioned that there are 171616 possible results, meaning more than 100 per year, so it won't get all of them. You can use another parameter 'company' in combination.
https://www.sbir.gov/api/awards.json?year=2010&company=luna
https://www.sbir.gov/api/awards.json?year=2011&company=luna
https://www.sbir.gov/api/awards.json?year=2010&company=other_company
https://www.sbir.gov/api/awards.json?year=2011&company=other_company
Now you are getting up to 100 results per company per year. That will give you way more results. You can get the list of companies from another endpoint they provide, which doesn't seem to have a limit on results displayed - https://www.sbir.gov/api/firm.json , watch out though, the json that comes out is absolutely massive and may freeze your laptop. You can use the values from that json for the 'company' parameter and cycle through those.
Of course all of that is a workaround and still doesn't guarantee you getting ALL of the results (although it might get them all). My first action would be to try to contact website admins telling them about your problem. A common thing to do for the apis that return a massive list of results is to provide a page parameter in the url - https://www.sbir.gov/api/awards.json?page=2 so that you can cycle through pages of results. Maybe you can persuade them to do that.
I wish they have better documentation. It seems we can do pagination via:
https://www.sbir.gov/api/awards.json?agency=DOE&start=100
https://www.sbir.gov/api/awards.json?agency=DOE&start=200
https://www.sbir.gov/api/awards.json?agency=DOE&start=300

Date Range for Facebook Graph API request on posts level

I am working on a tool for my company created to get data from our Facebook publications. It has not been working for a while, so I have to get all the historical data from June to November 2018.
My two scripts (one that get title and type of publication, and the other that get the number of link clicks) are working well to get data from last pushes, but when I try to add a date range in my Graph API request, I have some issues:
the regular query is [page_id]/posts?fields=id,created_time,link,type,name
the query for historical data is [page_id]/posts?fields=id,created_time,link,type,name,since=1529280000&until=1529712000, as the API is supposed to work with unixtime
I get perfect results for regular use, but the results for historical data only shows video publications in Graph API Explorer, with a debug message saying:
The since field does not exist on the PagePost object.
Same for "until" field when not using "since". I tried to replace "posts/" with "feed/" but it returned the exact same result...
Do you have any idea of how to get all the publications from a Page I own on a certain date range?
So it seems that it is not possible to request this kind of data unfortunately, third party services must be used...

Django - when best to calculate statistics on large amounts of data

I'm working on a Django application that consists of a scraper that scrapes thousands of store items (price, description, seller info) per day and a django-template frontend that allows the user to access the data and view various statistics.
For example: the user is able to click on 'Item A', and gets a detail view that lists various statistics about 'Item A' (Like linegraphs about price over time, a price distribution, etc)
The user is also able to click on reports of the individual 'scrapes' and get details about the number of items scraped, average price. Etc.
All of these statistics are currently calculated in the view itself.
This all works well when working locally, on a small development database with +/100 items. However, when in production this database will eventually consist of 1.000.000+ lines. Which leads me to wonder if calculating the statistics in the view wont lead to massive lag in the future. (Especially as I plan to extend the statistics with more complicated regression-analysis, and perhaps some nearest neighbour ML classification)
The advantage of the view based approach is that the graphs are always up to date. I could offcourse also schedule a CRONJOB to make the calculations every few hours (perhaps even on a different server). This would make accessing the information very fast, but would also mean that the information could be a few hours old.
I've never really worked with data of this scale before, and was wondering what the best practises are.
As with anything performance-related, do some testing and profile your application. Don't get lured into the premature optimization trap.
That said, given the fact that these statistics don't change, you could perform them asynchronously each time you do a scrape. Like the scrape process itself, this calculation process should be done asynchronously, completely separate from your Django application. When the scrape happens it would write to the database directly and set some kind of status field to processing. Then kick off the calculation process which, when completed, will fill in the stats fields and set the status to complete. This way you can show your users how far along the processing chain they are.
People love feedback over immediate results and they'll tolerate considerable delays if they know they'll eventually get a result. Strand a user and they'll get frustrate more quickly than any computer can finish processing; Lead them on a journey and they'll wait for ages to hear how the story ends.

github api - fetch all commits of all repos of an organisation based on date

Assembla provides a simple way to fetch all commits of an organisation using api.assembla.com/v1/activity.json and it takes to and from parameters allowing to get commits of selected date(from all the spaces(repos) the user is participating.
Is there any similar way in Github ?
I found these for Github:
/repos/:owner/:repo/commits
Accepts since and until parameters for getting commits of selected date. But, since I want commits from all repos, I have to loop over all those repos and fetch commits for each repo.
/users/:user/events
This shows the commits of a user. I dont have any problem looping over all the users in the org, but how can I get for a particular date ?
/orgs/:org/events
This shows commits of all users of all repos but dont know how to fetch for a particular date ?
The problem with using the /users/:user/events endpoint is that you just don't get the PushEvents and you would have to skip over non-commit events and perform more calls to the API. Assuming you're authenticated, you should be safe so long as your users aren't hyper active.
For /orgs/:org/events I don't think they accept parameters for anything, but I can check with the API designers.
And just in case you aren't familiar, these are all paginated results. So you can go back until the beginning with the Link headers. My library (github3.py) provides iterators to do this for you automatically. You can also tell it how many events you'd like. (Same with commits, etc). But yeah, I'll come back an edit after talking to the API guys at GitHub.
Edit: Conversation
You might want to check out the GitHub Archive project -- http://www.githubarchive.org/, and the ability to query the archive using Google's BigQuery. Sounds like it would be a perfect tool for the job -- I'm pretty sure you could get exactly what you want with a single query.
The other option is to call the GitHub API -- iterate over all events for the organization and filter out the ones that don't satisfy your date rage criteria and event type criteria (commits). But since you can't specify date ranges in the API call, you will probably do a lot of calls to get the the events that interest you. Notice that you don't have to iterate over every page starting from 0 to find the page that contains the first result in the date range -- just do a (variation of) binary search over page numbers to find any page that contains a commit in the date range, a then iterate in both directions until you break out of the date range. That should reduce the number of API calls you make.

Caching online prices fetched via API unless they change

I'm currently working on a site that makes several calls to big name online sellers like eBay and Amazon to fetch prices for certain items. The issue is, currently it takes a few seconds (as far as I can tell, this time is from making the calls) to load the results, which I'd like to be more instant (~10 seconds is too much in my opinion).
I've already cached other information that I need to fetch, but that information is static. Is there a way that I can cache the prices but update them only when needed? The code is in Python and I store info in a mySQL database.
I was thinking of somehow using chron or something along that lines to update it every so often, but it would be nice if there was a simpler and less intense approach to this problem.
Thanks!
You can use memcache to do the caching. The first request will be slow, but the remaining requests should be instant. You'll want a cron job to keep it up to date though. More caching info here: Good examples of python-memcache (memcached) being used in Python?
Have you thought about displaying the cached data, then updating the prices via an ajax callback? You could notify the user if the price changed with a SO type notification bar or similar.
This way the users get results immediately, and updated prices when they are available.
Edit
You can use jquery:
Assume you have a script names getPrices.php that returns a json array of the id of the item, and it's price.
No error handling etc here, just to give you an idea
My necklace: <div id='1'> $123.50 </div><br>
My bracelet: <div id='1'> $13.50 </div><br>
...
<script>
$(document).ready(function() {
$.ajax({ url: "getPrices.php", context: document.body, success: function(data){
for (var price in data)
{
$(price.id).html(price.price);
}
}}));
</script>
You need to handle the following in your application:
get the price
determine if the price has changed
cache the price information
For step 1, you need to consider how often the item prices will change. I would go with your instinct to set a Cron job for a process which will check for new prices on the items (or on sets of items) at set intervals. This is trivial at small scale, but when you have tens of thousands of items the architecture of this system will become very important.
To avoid delays in page load, try to get the information ahead of time as much as possible. I don't know your use case, but it'd be good to prefetch the information as much as possible and avoid having the user wait for an asynchronous JavaScript call to complete.
For step 2, if it's a new item or if the price has changed, update the information in your caching system.
For step 3, you need to determine the best caching system for your needs. As others have suggested, memcached is a possible solution. There are a variety of "NoSQL" databases you could check out, or even cache the results in MySQL.
How are you getting the price? If you are scrapping the data from the normal HTML page using a tool such as BeautifulSoup, that may be slowing down the round-trip time. In this case, it might help to compute a fast checksum (such as MD5) from the page to see if it has changed, before parsing it. If you are using a API which gives a short XML version of the price, this is probably not an issue.

Categories