I was wondering if I could set some conditions that have to be met for the information to be stored (doing web-scraping with Scrapy version 1.7.3).
For example, only storing the movies with a rating greater than 7 while scraping IMDB's website.
Or would I have to do it manually when looking through the output file? (I am currently outputting the data as a CSV file)
This is an interesting question, and yes, scrapy can totally help you with this. There some approaches you can take, if it is only for manipulating the items before actually "returning" them (which means it is already an output) maybe I'd recommend to use Item Loaders which basically helps you setup rules per field on each item.
For actually dropping items with the respective rules I'd suggest you to use and Item Pipeline which serves as a final filter before again returning the items, in this case it would be interesting for you to combine it with something like Cerberus which helps you define whole item schemas and according to that, drop or return an item.
Related
This question is specifically on, from a architecture/design perspective, where is the best place to parse text obtained from a response object in Scrapy?
Context:
I'm learning Python and starting with scraping data from a popular NFL football database site
I've gotten all the data points I need, and have them stored in a local database (sqlite)
One thing I am scraping is a 'play by play', which collects the things that happen in every play. There is a descriptive text field that may say things like "Player XYZ threw a pass to Player ABC" or "Player 123 ran the ball up the middle".
I'd like to take that text field, parse it, and categorize it into general groups such as "Passing Play", "Rushing Play" etc based off certain keyword patterns.
My question is as follows: When and where is the best place to do that? Do I create my own middleware in Scrapy so that by the time it reaches the pipeline the item already has the categories and thus is stored in my database? Or do I just collect the scraped responses 'raw', store directly in my DB and do data cleaning in SQL after the fact, or even via a separate python script?
As mentioned, new to programming as a whole so I'm not sure what's best from a 'design' perspective
If you doing any scraping in scrapy, you will have to think about which item fields you want to use to collect the data. So figuring out what those fields are before you write your scraper is a first.
I don't necessarily think you need your own middleware unless your data is particularly needing work done in terms of requests and responses. The middlewares are mostly useful for processing reqeuest and responses rather than data manipulation/cleaning. i.e if you have duplicates or needing to change responses or add requests etc...
Scrapy is built for data extraction and has already a robust way of putting that information into a Dictionary-like API called an ItemsAdapter, which is essentially a wrapper for different ways of storing data.
There also ways to clean data in small ways and in larger ways within Scrapy. You can ItemLoaders which puts your items through a small function that can manipulate data or use a pipeline. Pipelines give you lots of flexibility in handling extracted data.
You will have to think about database design and what tables you're going to use, because ultimately that is where you will putting your data. It's quite easy to setup a Database Pipeline in Scrapy. The database pipeline is flexibly enough for you place data into any table you want using SQL queries.
Familiarising yourself with the Scrapy Architecture here might help you create a mental model of the process. You can see this here
I'm scrapping data from a subsection of Amazon. I want to be able to detect when a product is no longer available if I have previously scrapped that product. Is there a way to deal with outdated data like this?
The only solution I can think of so far is to completely purge the data and start the scrapping over but this will cause the metadata assigned to these items to be lost. My only other solution I can think of is an ad-hoc comparison of the two scrappings.
How are you storing the data after each run?
You might consider just checking the for existence of a buy button on subsequent scrapings and marking a flag on the item as unavailable.
After one month of learning and toying around with python and doing tons and tons of exercieses and samples I am still not really able to answer one important question to myself. If I generate data, for example with loading them from a xml or from scraping. How do I work with them? Right now I put each and every entrie directly into a sqlite db.
For exmaple:
I read a news feed, I put id, title, description, link, tags into one row in my sqlite db. I do some categorizing according to keywords within the the title or description. Each news entry goes into one row. After that I can read them row by row or at once.
I can filter them, sort them... But somehow I feel like there is a better way. Put them all with a dictionary into one big list and somehow work with this list. And then after I worked thru them sorted them or got even more infos with that data. Put them into the sqlite db. But none of my books or tutorial talk about this topic!? I try to make an example is a news article more important if a certain keyword comes up multiple times or in with an other keyword. Compare the news from that page with the news from another page with same keywords.
I guess u will be laughing and say. He is talking about ..... how can he not know that. But I am new to all of that. Sooorrry.
Thank you for your help guiding me in the right direction.
So I am working on a pet project where I'm storing various text files. I have setup my app to save the tags as a string in one of my collections so an example would be:
tags: "Linux Apache WSGI"
Storing them and searching for them work just fine but my question comes when I want to do something like a tag cloud, count all the various tags, or make a dynamic selection system based on tags, what is the best way to break them up to work with? Or should I be storing them some other way?
Logically I could scan through every record and get all the tags, break them based on space, then cache the result somehow. Maybe that's the right answer but I wanted to ask the community wisdom.
I'm using pymongo to interact with my database.
Or should I be storing them some other way?
The standard way to store tags is to store them as an array. In your case, the DB would look something like:
tags: ['linux', 'apached', 'wsgi']
... what is the best way to break them up to work with?
This is what Map/Reduce is designed for. This effectively "scans every record". The output of a Map/Reduce is another collection that you can query.
However, there's also another way to do this and that's to keep "counters" and update them. So when you save a new document you also increment all of the tags related to that document.
So, basically what I'm trying to do is a hockey pool application, and there are a ton of ways I should be able to filter to view the data. For example, filter by free agent, goals, assists, position, etc.
I'm planning on doing this with a bunch of query strings, but I'm not sure what the best approach would be to pass along the these query strings. Lets say I wanted to be on page 2 (as I'm using pagination for splitting the pages), sort by goals, and only show forwards, I would have the following query set:
?page=2&sort=g&position=f
But if I was on that page, and it was showing me all this corresponding info, if I was to click say, points instead of goals, I would still want all my other filters in tact, so like this:
?page=2&sort=p&position=f
Since HTTP is stateless, I'm having trouble on what the best approach to this would be.. If anyone has some good ideas they would be much appreciated, thanks ;)
Shawn J
Firstly, think about whether you really want to save all the parameters each time. In the example you give, you change the sort order but preserve the page number. Does this really make sense, considering you will now have different elements on that page. Even more, if you change the filters, the currently selected page number might not even exist.
Anyway, assuming that is what you want, you don't need to worry about state or cookies or any of that, seeing as all the information you need is already in the GET parameters. All you need to do is to replace one of these parameters as required, then re-encode the string. Easy to do in a template tag, since GET parameters are stored as a QueryDict which is basically just a dictionary.
Something like (untested):
#register.simple_tag
def url_with_changed_parameter(request, param, value):
params = request.GET
request[param] = value
return "%s?%s" % (request.path, params.urlencode())
and you would use it in your template:
{% url_with_changed_parameter request "page" 2 %}
Have you looked at django-filter? It's really awesome.
Check out filter mechanism in the admin application, it includes dealing with dynamically constructed URLs with filter information supplied in the query string.
In addition - consider saving actual state information in cookies/sessions.
If You want to save all the "parameters", I'd say they are resource identifiers and should normally be the part of URI.