Caching online prices fetched via API unless they change - python

I'm currently working on a site that makes several calls to big name online sellers like eBay and Amazon to fetch prices for certain items. The issue is, currently it takes a few seconds (as far as I can tell, this time is from making the calls) to load the results, which I'd like to be more instant (~10 seconds is too much in my opinion).
I've already cached other information that I need to fetch, but that information is static. Is there a way that I can cache the prices but update them only when needed? The code is in Python and I store info in a mySQL database.
I was thinking of somehow using chron or something along that lines to update it every so often, but it would be nice if there was a simpler and less intense approach to this problem.
Thanks!

You can use memcache to do the caching. The first request will be slow, but the remaining requests should be instant. You'll want a cron job to keep it up to date though. More caching info here: Good examples of python-memcache (memcached) being used in Python?

Have you thought about displaying the cached data, then updating the prices via an ajax callback? You could notify the user if the price changed with a SO type notification bar or similar.
This way the users get results immediately, and updated prices when they are available.
Edit
You can use jquery:
Assume you have a script names getPrices.php that returns a json array of the id of the item, and it's price.
No error handling etc here, just to give you an idea
My necklace: <div id='1'> $123.50 </div><br>
My bracelet: <div id='1'> $13.50 </div><br>
...
<script>
$(document).ready(function() {
$.ajax({ url: "getPrices.php", context: document.body, success: function(data){
for (var price in data)
{
$(price.id).html(price.price);
}
}}));
</script>

You need to handle the following in your application:
get the price
determine if the price has changed
cache the price information
For step 1, you need to consider how often the item prices will change. I would go with your instinct to set a Cron job for a process which will check for new prices on the items (or on sets of items) at set intervals. This is trivial at small scale, but when you have tens of thousands of items the architecture of this system will become very important.
To avoid delays in page load, try to get the information ahead of time as much as possible. I don't know your use case, but it'd be good to prefetch the information as much as possible and avoid having the user wait for an asynchronous JavaScript call to complete.
For step 2, if it's a new item or if the price has changed, update the information in your caching system.
For step 3, you need to determine the best caching system for your needs. As others have suggested, memcached is a possible solution. There are a variety of "NoSQL" databases you could check out, or even cache the results in MySQL.

How are you getting the price? If you are scrapping the data from the normal HTML page using a tool such as BeautifulSoup, that may be slowing down the round-trip time. In this case, it might help to compute a fast checksum (such as MD5) from the page to see if it has changed, before parsing it. If you are using a API which gives a short XML version of the price, this is probably not an issue.

Related

Notion API quickly delete and repopulate entire DB

Background
I'm creating a Notion DB that will contain data about different analyzers my team uses (analyzer name, location, last time the analyzer sent data, etc.). Since I'm using live data I need to have a way to quickly update the data of all analyzers in the notion db.
I'm currently using a python script to get the analyzers data and upload it to the Notion DB. Currently I read each row, get it's ID that I use to update the row's data - but this is too slow: it takes more than 30 seconds to update 100 rows.
The Question
I'd like to know if there's a way to quickly update the data of many rows (maybe in one big bulk operation). The goal is perhaps 100 row updates per second (instead of 30 seconds).
There are multiple things one could do here - sadly none of it will improve the updates drastically. Currently there is no way to update multiple rows, or to be more precise pages. I am not sure what "read each row" refers to, but you can retrieve multiple pages of a database at once - up to 100. If you are retrieving them one by one, this could be updated.
Secondly, I'd like to know how often the analyzers change and if, will they be altered by the Python script or updated in Notion? If this does not happen too often, you might be able to cache the page_ids and retrieve the ids not every time you update. Sadly the last_edited_time of the database does not reflect any addition or removal of it's children, so simply checking this is not an option.
The third and last way to improve performance is multi-threading. You can send multiple requests at the same time as the amount of requests is usually the bottleneck.
I know none of these will really help you, but sadly no efficient method to update multiple pages exists.
There is also the rate limit of 3 requests per second, which is enforced by Notion to ensure fair performance for all users. If you send more requests, you will start receiving responses with an HTTP 429 code. Your integration should be written in such a way that this response will be respected and should prevent any requests to be sent before the time indicated in the indicated number of seconds as per this page on the notion developer API guidelines.

Dealing with old records in Scrapy?

I'm scrapping data from a subsection of Amazon. I want to be able to detect when a product is no longer available if I have previously scrapped that product. Is there a way to deal with outdated data like this?
The only solution I can think of so far is to completely purge the data and start the scrapping over but this will cause the metadata assigned to these items to be lost. My only other solution I can think of is an ad-hoc comparison of the two scrappings.
How are you storing the data after each run?
You might consider just checking the for existence of a buy button on subsequent scrapings and marking a flag on the item as unavailable.

Django - when best to calculate statistics on large amounts of data

I'm working on a Django application that consists of a scraper that scrapes thousands of store items (price, description, seller info) per day and a django-template frontend that allows the user to access the data and view various statistics.
For example: the user is able to click on 'Item A', and gets a detail view that lists various statistics about 'Item A' (Like linegraphs about price over time, a price distribution, etc)
The user is also able to click on reports of the individual 'scrapes' and get details about the number of items scraped, average price. Etc.
All of these statistics are currently calculated in the view itself.
This all works well when working locally, on a small development database with +/100 items. However, when in production this database will eventually consist of 1.000.000+ lines. Which leads me to wonder if calculating the statistics in the view wont lead to massive lag in the future. (Especially as I plan to extend the statistics with more complicated regression-analysis, and perhaps some nearest neighbour ML classification)
The advantage of the view based approach is that the graphs are always up to date. I could offcourse also schedule a CRONJOB to make the calculations every few hours (perhaps even on a different server). This would make accessing the information very fast, but would also mean that the information could be a few hours old.
I've never really worked with data of this scale before, and was wondering what the best practises are.
As with anything performance-related, do some testing and profile your application. Don't get lured into the premature optimization trap.
That said, given the fact that these statistics don't change, you could perform them asynchronously each time you do a scrape. Like the scrape process itself, this calculation process should be done asynchronously, completely separate from your Django application. When the scrape happens it would write to the database directly and set some kind of status field to processing. Then kick off the calculation process which, when completed, will fill in the stats fields and set the status to complete. This way you can show your users how far along the processing chain they are.
People love feedback over immediate results and they'll tolerate considerable delays if they know they'll eventually get a result. Strand a user and they'll get frustrate more quickly than any computer can finish processing; Lead them on a journey and they'll wait for ages to hear how the story ends.

Storing queryset after fetching it once

I am new to django and web development.
I am building a website with a considerable size of database.
Large amount of data should be shown in many pages, and a lot of this data is repeated. I mean I need to show the same data in many pages.
Is it a good idea to make a query to the database asking for the data in every GET request? it takes many seconds to get the data every time I refresh the page or request another page that has the same data shown.
Is there a way to fetch the data once and store it somewhere and just display it in every page, and only refetch it when some updates are being done.
I thought about the session but I found that it is limited to 5MB which is small for my data.
Any suggestions?
Thank you.
Django's cache - as mentionned by Leistungsabfall - can help, but like most cache systems it has some drawbacks too if you use it naively for this kind of problems (long queries/computations): when the cache expires, the next request will have to recompute the whole thing - which might take some times durring which every new request will trigger a recomputation... Also, proper cache invalidation can be really tricky.
Actually there's no one-size-fits-all answer to your question, the right solution is often a mix of different solutions (code optimisation, caching, denormalisation etc), based on your actual data, how often they change, how much visitors you have, how critical it is to have up-to-date data etc, but the very first steps would be to
check the code fetching the data and find out if there are possible optimisations at this level using QuerySet features (.select_related() / prefetch_related(), values() and/or values_list(), annotations etc) to avoid issues like the "n+1 queries" problem, fetching whole records and building whole model instances when you only need a single field's value, doing computations at the Python level when they could be done at the database level etc
check your db schema's indexes - well used indexes can vastly improve performances, badly used ones can vastly degrade performances...
and of course use the right tools (db query logging, Python's profiler etc) to make sure you identify the real issues.

Live Updating Widget for 100+ concurrent users

what would you use if you had to have a div box on your website that would have to be updated constantly with new HTML content from the server.
simple polling is probably not very resource inefficient - imagine also having 10'000 users and the div has to update.
what is the most efficient or elegant solution for such a problem?
are there existing widgets which contain this "autoupdate" functionality?
Consider using memcached. By caching content in memory
you will reduce the number of calls to the (database?) server that generates the content.
To keep the content up to date you should use the memcache pattern. A short expiration time will provide more up to date content, a long expiration time will provide better performance.

Categories