After one month of learning and toying around with python and doing tons and tons of exercieses and samples I am still not really able to answer one important question to myself. If I generate data, for example with loading them from a xml or from scraping. How do I work with them? Right now I put each and every entrie directly into a sqlite db.
For exmaple:
I read a news feed, I put id, title, description, link, tags into one row in my sqlite db. I do some categorizing according to keywords within the the title or description. Each news entry goes into one row. After that I can read them row by row or at once.
I can filter them, sort them... But somehow I feel like there is a better way. Put them all with a dictionary into one big list and somehow work with this list. And then after I worked thru them sorted them or got even more infos with that data. Put them into the sqlite db. But none of my books or tutorial talk about this topic!? I try to make an example is a news article more important if a certain keyword comes up multiple times or in with an other keyword. Compare the news from that page with the news from another page with same keywords.
I guess u will be laughing and say. He is talking about ..... how can he not know that. But I am new to all of that. Sooorrry.
Thank you for your help guiding me in the right direction.
Related
I have a database of possibly infinitely many products.
Every single product has a Title and tags representing it.
I am trying to make a system where somebody can search for the product using the Title primarily, but helping to sort out the Top 100 products using tags.
Some people have suggested Django (Taggit) or just forming an API.
I am not sure how relevant that is and how needed it is for something simple like the task mentioned above.
I am also looking for something efficient.
Coding (python) newbie here.I am trying to make a small project where I gather news headlines from an API.The API I am trying to use is newsapi.org. Every time I enquire the API, It gives me a a long list of news articles,which I add to a database . But when I enquire again after some time gap , I get a new article(S) plus many, if not all the previous articles. How do I avoid adding duplicates to the database? I thought of iterating through the last few new entries and see if such articles already exist but somehow I feel that is not the way.Any help is appreciated.
Thank you
The API you want to use includes a from parameter that you can set to filter the articles returned and only get the ones that are more recent.
If you store the current time each time you gather data from the api, you can use it in the from parameter on the next api call you will do to only get what has been published in between the two calls.
This question is specifically on, from a architecture/design perspective, where is the best place to parse text obtained from a response object in Scrapy?
Context:
I'm learning Python and starting with scraping data from a popular NFL football database site
I've gotten all the data points I need, and have them stored in a local database (sqlite)
One thing I am scraping is a 'play by play', which collects the things that happen in every play. There is a descriptive text field that may say things like "Player XYZ threw a pass to Player ABC" or "Player 123 ran the ball up the middle".
I'd like to take that text field, parse it, and categorize it into general groups such as "Passing Play", "Rushing Play" etc based off certain keyword patterns.
My question is as follows: When and where is the best place to do that? Do I create my own middleware in Scrapy so that by the time it reaches the pipeline the item already has the categories and thus is stored in my database? Or do I just collect the scraped responses 'raw', store directly in my DB and do data cleaning in SQL after the fact, or even via a separate python script?
As mentioned, new to programming as a whole so I'm not sure what's best from a 'design' perspective
If you doing any scraping in scrapy, you will have to think about which item fields you want to use to collect the data. So figuring out what those fields are before you write your scraper is a first.
I don't necessarily think you need your own middleware unless your data is particularly needing work done in terms of requests and responses. The middlewares are mostly useful for processing reqeuest and responses rather than data manipulation/cleaning. i.e if you have duplicates or needing to change responses or add requests etc...
Scrapy is built for data extraction and has already a robust way of putting that information into a Dictionary-like API called an ItemsAdapter, which is essentially a wrapper for different ways of storing data.
There also ways to clean data in small ways and in larger ways within Scrapy. You can ItemLoaders which puts your items through a small function that can manipulate data or use a pipeline. Pipelines give you lots of flexibility in handling extracted data.
You will have to think about database design and what tables you're going to use, because ultimately that is where you will putting your data. It's quite easy to setup a Database Pipeline in Scrapy. The database pipeline is flexibly enough for you place data into any table you want using SQL queries.
Familiarising yourself with the Scrapy Architecture here might help you create a mental model of the process. You can see this here
I was wondering if I could set some conditions that have to be met for the information to be stored (doing web-scraping with Scrapy version 1.7.3).
For example, only storing the movies with a rating greater than 7 while scraping IMDB's website.
Or would I have to do it manually when looking through the output file? (I am currently outputting the data as a CSV file)
This is an interesting question, and yes, scrapy can totally help you with this. There some approaches you can take, if it is only for manipulating the items before actually "returning" them (which means it is already an output) maybe I'd recommend to use Item Loaders which basically helps you setup rules per field on each item.
For actually dropping items with the respective rules I'd suggest you to use and Item Pipeline which serves as a final filter before again returning the items, in this case it would be interesting for you to combine it with something like Cerberus which helps you define whole item schemas and according to that, drop or return an item.
Additional questions regarding SilentGhost's initial answer to a problem I'm having parsing Twitter RSS feeds. See also partial code below.
First, could I insert tags[0], tags[1], etc., into the database, or is there a different/better way to do it?
Second, almost all of the entries have a url, but a few don't; likewise, many entries don't have the hashtags. So, would the thing to do be to create default values for url and tags? And if so, do you have any hints on how to do that? :)
Third, when you say the single-table db design is not optimal, do you mean I should create a separate table for tags? Right now, I have one table for the RSS feed urls and another table with all the rss entry data (summar.y, date, etc.).
I've pasted in a modified version of the code you posted. I had some success in getting a "tinyurl" variable to get into the sqlite database, but now it isn't working. Not sure why.
Lastly, assuming I can get the whole thing up and running (smile), is there a central site where people might appreciate seeing my solution? Or should I just post something on my own blog?
Best,
Greg
I would suggest reading up on database normalisation, especially on 1st and 2nd normal forms. Once you're done with it, I hope there won't be need for default values, and your db schema evolves into something more appropriate.
There are plenty of options for sharing your source code on the web, depending on what versioning system you're most comfortable with you might have a look at such well know sites as google code, bitbucket, github and many other.