I'm scrapping data from a subsection of Amazon. I want to be able to detect when a product is no longer available if I have previously scrapped that product. Is there a way to deal with outdated data like this?
The only solution I can think of so far is to completely purge the data and start the scrapping over but this will cause the metadata assigned to these items to be lost. My only other solution I can think of is an ad-hoc comparison of the two scrappings.
How are you storing the data after each run?
You might consider just checking the for existence of a buy button on subsequent scrapings and marking a flag on the item as unavailable.
Related
This question is specifically on, from a architecture/design perspective, where is the best place to parse text obtained from a response object in Scrapy?
Context:
I'm learning Python and starting with scraping data from a popular NFL football database site
I've gotten all the data points I need, and have them stored in a local database (sqlite)
One thing I am scraping is a 'play by play', which collects the things that happen in every play. There is a descriptive text field that may say things like "Player XYZ threw a pass to Player ABC" or "Player 123 ran the ball up the middle".
I'd like to take that text field, parse it, and categorize it into general groups such as "Passing Play", "Rushing Play" etc based off certain keyword patterns.
My question is as follows: When and where is the best place to do that? Do I create my own middleware in Scrapy so that by the time it reaches the pipeline the item already has the categories and thus is stored in my database? Or do I just collect the scraped responses 'raw', store directly in my DB and do data cleaning in SQL after the fact, or even via a separate python script?
As mentioned, new to programming as a whole so I'm not sure what's best from a 'design' perspective
If you doing any scraping in scrapy, you will have to think about which item fields you want to use to collect the data. So figuring out what those fields are before you write your scraper is a first.
I don't necessarily think you need your own middleware unless your data is particularly needing work done in terms of requests and responses. The middlewares are mostly useful for processing reqeuest and responses rather than data manipulation/cleaning. i.e if you have duplicates or needing to change responses or add requests etc...
Scrapy is built for data extraction and has already a robust way of putting that information into a Dictionary-like API called an ItemsAdapter, which is essentially a wrapper for different ways of storing data.
There also ways to clean data in small ways and in larger ways within Scrapy. You can ItemLoaders which puts your items through a small function that can manipulate data or use a pipeline. Pipelines give you lots of flexibility in handling extracted data.
You will have to think about database design and what tables you're going to use, because ultimately that is where you will putting your data. It's quite easy to setup a Database Pipeline in Scrapy. The database pipeline is flexibly enough for you place data into any table you want using SQL queries.
Familiarising yourself with the Scrapy Architecture here might help you create a mental model of the process. You can see this here
I was wondering if I could set some conditions that have to be met for the information to be stored (doing web-scraping with Scrapy version 1.7.3).
For example, only storing the movies with a rating greater than 7 while scraping IMDB's website.
Or would I have to do it manually when looking through the output file? (I am currently outputting the data as a CSV file)
This is an interesting question, and yes, scrapy can totally help you with this. There some approaches you can take, if it is only for manipulating the items before actually "returning" them (which means it is already an output) maybe I'd recommend to use Item Loaders which basically helps you setup rules per field on each item.
For actually dropping items with the respective rules I'd suggest you to use and Item Pipeline which serves as a final filter before again returning the items, in this case it would be interesting for you to combine it with something like Cerberus which helps you define whole item schemas and according to that, drop or return an item.
After one month of learning and toying around with python and doing tons and tons of exercieses and samples I am still not really able to answer one important question to myself. If I generate data, for example with loading them from a xml or from scraping. How do I work with them? Right now I put each and every entrie directly into a sqlite db.
For exmaple:
I read a news feed, I put id, title, description, link, tags into one row in my sqlite db. I do some categorizing according to keywords within the the title or description. Each news entry goes into one row. After that I can read them row by row or at once.
I can filter them, sort them... But somehow I feel like there is a better way. Put them all with a dictionary into one big list and somehow work with this list. And then after I worked thru them sorted them or got even more infos with that data. Put them into the sqlite db. But none of my books or tutorial talk about this topic!? I try to make an example is a news article more important if a certain keyword comes up multiple times or in with an other keyword. Compare the news from that page with the news from another page with same keywords.
I guess u will be laughing and say. He is talking about ..... how can he not know that. But I am new to all of that. Sooorrry.
Thank you for your help guiding me in the right direction.
I'm currently working on a site that makes several calls to big name online sellers like eBay and Amazon to fetch prices for certain items. The issue is, currently it takes a few seconds (as far as I can tell, this time is from making the calls) to load the results, which I'd like to be more instant (~10 seconds is too much in my opinion).
I've already cached other information that I need to fetch, but that information is static. Is there a way that I can cache the prices but update them only when needed? The code is in Python and I store info in a mySQL database.
I was thinking of somehow using chron or something along that lines to update it every so often, but it would be nice if there was a simpler and less intense approach to this problem.
Thanks!
You can use memcache to do the caching. The first request will be slow, but the remaining requests should be instant. You'll want a cron job to keep it up to date though. More caching info here: Good examples of python-memcache (memcached) being used in Python?
Have you thought about displaying the cached data, then updating the prices via an ajax callback? You could notify the user if the price changed with a SO type notification bar or similar.
This way the users get results immediately, and updated prices when they are available.
Edit
You can use jquery:
Assume you have a script names getPrices.php that returns a json array of the id of the item, and it's price.
No error handling etc here, just to give you an idea
My necklace: <div id='1'> $123.50 </div><br>
My bracelet: <div id='1'> $13.50 </div><br>
...
<script>
$(document).ready(function() {
$.ajax({ url: "getPrices.php", context: document.body, success: function(data){
for (var price in data)
{
$(price.id).html(price.price);
}
}}));
</script>
You need to handle the following in your application:
get the price
determine if the price has changed
cache the price information
For step 1, you need to consider how often the item prices will change. I would go with your instinct to set a Cron job for a process which will check for new prices on the items (or on sets of items) at set intervals. This is trivial at small scale, but when you have tens of thousands of items the architecture of this system will become very important.
To avoid delays in page load, try to get the information ahead of time as much as possible. I don't know your use case, but it'd be good to prefetch the information as much as possible and avoid having the user wait for an asynchronous JavaScript call to complete.
For step 2, if it's a new item or if the price has changed, update the information in your caching system.
For step 3, you need to determine the best caching system for your needs. As others have suggested, memcached is a possible solution. There are a variety of "NoSQL" databases you could check out, or even cache the results in MySQL.
How are you getting the price? If you are scrapping the data from the normal HTML page using a tool such as BeautifulSoup, that may be slowing down the round-trip time. In this case, it might help to compute a fast checksum (such as MD5) from the page to see if it has changed, before parsing it. If you are using a API which gives a short XML version of the price, this is probably not an issue.
Additional questions regarding SilentGhost's initial answer to a problem I'm having parsing Twitter RSS feeds. See also partial code below.
First, could I insert tags[0], tags[1], etc., into the database, or is there a different/better way to do it?
Second, almost all of the entries have a url, but a few don't; likewise, many entries don't have the hashtags. So, would the thing to do be to create default values for url and tags? And if so, do you have any hints on how to do that? :)
Third, when you say the single-table db design is not optimal, do you mean I should create a separate table for tags? Right now, I have one table for the RSS feed urls and another table with all the rss entry data (summar.y, date, etc.).
I've pasted in a modified version of the code you posted. I had some success in getting a "tinyurl" variable to get into the sqlite database, but now it isn't working. Not sure why.
Lastly, assuming I can get the whole thing up and running (smile), is there a central site where people might appreciate seeing my solution? Or should I just post something on my own blog?
Best,
Greg
I would suggest reading up on database normalisation, especially on 1st and 2nd normal forms. Once you're done with it, I hope there won't be need for default values, and your db schema evolves into something more appropriate.
There are plenty of options for sharing your source code on the web, depending on what versioning system you're most comfortable with you might have a look at such well know sites as google code, bitbucket, github and many other.