RSS aggregation packages - python

We are looking to add a news/articles section to an existing site which will be powered by aggregating content via RSS feeds. The requirements are
Be able to aggregate lots of feeds. Initially we will start with small number of and eventually we may be aggregating few hundreds of them.
We don't want to display the whole post on our site. We will display summary or short description and when user clicks on read more, he will be taken to the original post on external site.
We would like to grab the image/s related to a post and display that as a small thumbnail with a post on our site.
Create an automated tag cloud out of all the aggregated content.
Categorize aggregated content by using category/sub-category structure.
The aggregation piece should perform well.
Our web app is built using Django and so I am looking into selecting one the following packages. Based on our requirements, which package would you recommend?
django-planet
django-news
planetplanet
feedjack

If you have a good idea of what you want, why not just try them all? If you have pretty strict requirements, write it yourself, roll your own aggregator with feedparser.

Related

How can I export data from Zendesk?

My org is using utilizing Zendesk for work orders. To do this, we have created custom fields to manage statuses and various other information. I want to be able to export this data for reporting purposes to see what is completed, what is in progress, etc. but the 10 column limitation in Zendesk is an issue. Can I use the API to export these work order tickets with a column for each custom field and get it into CSV?
Can you be more specific on which Zendesk product are you using? They have lots of different products (Zendesk Support, Zendesk Talk, Zendesk Chat, etc.).
Feel free to contact me through email.
Zendesk Views do have current 10 column limitation. Although Zendesk provides multiple api endpoints to export data from its products.
You can do following:
Export with the time-based incremental export api export. Documentation link
Export with Query based search endpoint. Documentation link. There is also a zendesk community post "Support Search Reference" on how to create search query. Do take a look at that!
IMO you probably want to go via route 2.
There's also an Advanced Search marketplace app which is pretty handy and easy to use!
🔼 Vote this up 🔼, If you found this helpful! 🤓

Design Question: Best 'place' to parse Scrapy text in Items?

This question is specifically on, from a architecture/design perspective, where is the best place to parse text obtained from a response object in Scrapy?
Context:
I'm learning Python and starting with scraping data from a popular NFL football database site
I've gotten all the data points I need, and have them stored in a local database (sqlite)
One thing I am scraping is a 'play by play', which collects the things that happen in every play. There is a descriptive text field that may say things like "Player XYZ threw a pass to Player ABC" or "Player 123 ran the ball up the middle".
I'd like to take that text field, parse it, and categorize it into general groups such as "Passing Play", "Rushing Play" etc based off certain keyword patterns.
My question is as follows: When and where is the best place to do that? Do I create my own middleware in Scrapy so that by the time it reaches the pipeline the item already has the categories and thus is stored in my database? Or do I just collect the scraped responses 'raw', store directly in my DB and do data cleaning in SQL after the fact, or even via a separate python script?
As mentioned, new to programming as a whole so I'm not sure what's best from a 'design' perspective
If you doing any scraping in scrapy, you will have to think about which item fields you want to use to collect the data. So figuring out what those fields are before you write your scraper is a first.
I don't necessarily think you need your own middleware unless your data is particularly needing work done in terms of requests and responses. The middlewares are mostly useful for processing reqeuest and responses rather than data manipulation/cleaning. i.e if you have duplicates or needing to change responses or add requests etc...
Scrapy is built for data extraction and has already a robust way of putting that information into a Dictionary-like API called an ItemsAdapter, which is essentially a wrapper for different ways of storing data.
There also ways to clean data in small ways and in larger ways within Scrapy. You can ItemLoaders which puts your items through a small function that can manipulate data or use a pipeline. Pipelines give you lots of flexibility in handling extracted data.
You will have to think about database design and what tables you're going to use, because ultimately that is where you will putting your data. It's quite easy to setup a Database Pipeline in Scrapy. The database pipeline is flexibly enough for you place data into any table you want using SQL queries.
Familiarising yourself with the Scrapy Architecture here might help you create a mental model of the process. You can see this here

How to get all Wikipedia articles from category and sub categories using Python? [duplicate]

I want to get all the articles names under a category and its sub-categories.
Options I'm aware of:
Using the Wikipedia API. Does it have such an option??
d/l the dump. Which format would be better for my usage?
There is also an option to search in Wikipedia something like incategory:"music", but I didn't see an option to view that in XML.
Please share your thoughts
The following resource will help you to download all pages from the category and all its subcategories:
http://en.wikipedia.org/wiki/Wikipedia:CatScan
There is also an API available here:
https://www.mediawiki.org/wiki/API:Categorymembers
You can do this through the following two API methods:
For articles pages for this category
YOUR_URL/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Music
For get subcategories:
YOUR_URL/api.php?action=query&format=json&list=categorymembers&cmtype=subcat&cmtitle=Category:Music
You can get more info on Mediawiki API
Note that Wikipedia's categorization system is not a tree, or even an acyclic graph. It is quite possible that by continually following subcategory links you will eventually wind up back where you started.
If you are going to be making many such queries, you would be best served by downloading a database dump. If this will be an infrequent thing and will only be dealing with small categories, you could probably get away with making repeated queries to list=categorymembers.
incategory:"music" does not appear to do subcategory searching.

Extract statistical information from Wikipedia article

I'm currently extracting data from DBpedia articles using a SPARQLWrapper for python, but I can't seem to find how to extract the number of watchers (and other statistical information) for a given article.
Is there an easy way to achieve this? I don't mind if it's through DBpedia, or directly through wikipedia (using wget, for example).
Thanks for any advice.
It shell be prohibited to get the number of watchers for every arbitrary article, as it is considered to be a security leak if everyone could find unwatched pages. For example, only privileged users have access to Special:Unwatched Pages. There is a toolserver tool (which has access to the DB) showing the number of watchers, but it is restricted to pages with more than 30 watchers for the same reasons - at least unauthenticated.
The MediaWiki query API exposes only mostly content and status information about articles, though you can query and evaluate the public logs or revision histories as well to get statistical data about (public) user actions. For more stats about the Wikimedia sites you may have a look at Meta:Statistics, where various data sources (mostly http://stats.wikimedia.org/) and visualisations of them are listed.

Efficient storage of and access to web pages with Python

So like many people I want a way to download, index/extract information and store web pages efficiently. My first thought is to use MySQL and simply shove the pages in which would let me use FULLTEXT searches which would let me do ad hoc queries easily (in case I want to see if something exists and extract it/etc.). But of course performance wise I have some concerns especially with large objects/pages and high volumes of data. So that leads me to look at things like CouchDB/search engines/etc. So to summarize, my basic requirements are:
It must be Python compatible (libraries/etc.)
Store meta data (URL, time retrieved, any GET/POST stuff I sent), response code, etc. of the page I requested.
Store a copy of the original web page as sent by the server (might be content, might be 404 search page, etc.).
Extract information from the web page and store it in a database.
Have the ability to do ad hoc queries on the existing corpus of original web pages (for example a new type of information I want to extract, or to see how many of the pages have the string "fizzbuzz" or whatever in them.
And of course it must be open source/Linux compatible, I have no interest in something I can't modify or fiddle with.
So I'm thinking several broad options are:
Toss everything into MySQL, use FULLTEXT, go nuts, shard the contact if needed.
Toss meta data into MySQL, store the data on the file system or something like CouchDB, write some custom search stuff.
Toss meta data into MySQL, store the data on a file system with a web server (maybe /YYYY/MM/DD/HH/MM/SS/URL/), make sure there is no default index.html/etc specified (directory index each directory in other words) and use some search engine like Lucene or Sphinx index the content and use that to search. Biggest downside I see here is the inefficiency of repeatedly crawling the site.
Other solutions?
When answering please include links to any technologies you mention and if possible what programming languages it has libraries for (i.e. if it's Scala only or whatever it's probably not that useful since this is a Python project). If this question has already been asked (I'm sure it must have been) please let me know (I searched, no luck).
Why do you think solution (3), the Sphinx-based one, requires "repeatedly crawling the site"? Sphinx can accept and index many different data sources, including MySQL and PostgreSQL "natively" (there are contributed add-ons for other DBs such as Firebird) -- you can keep your HTML docs as columns in your DB if you like (modern PostgreSQL versions should have no trouble with that, and I imagine that MySQL's wouldn't either), just use Sphinx superior indexing and full-text search (including stemming &c). Your metadata all comes from headers, after all (plus the HTTP request body if you want to track requests in which you POSTed data, but not the HTTP response body at any rate).
One important practical consideration: I would recommend standardizing on UTF-8 -- html will come to you in all sorts of weird encodings, but there's no need to get crazy supporting that at search time -- just transcode every text page to UTF-8 upon arrival (from whatever funky encoding it came in), before storing and indexing it, and live happily ever after.
Maybe you could special-case non-textual responses to keep those in files (I can imagine that devoting gigabytes in the DB to storing e.g. videos which can't be body-indexed anyway might not be a good use of resources).
And BTW, Sphinx does come with Python bindings, as you request.
You may be trying to achieve too much with the storage of the html (and supporting files). It seems you wish this repository would both
allow to display a particular page as it was in its original site
provide indexing for locating pages relevant to a particular search criteria
The html underlying a web page once looked a bit akin to a self-standing document, but the pages crawled off the net nowadays are much messier: javascript, ajax snippets, advertisement sections, image blocks etc.
This reality may cause you to rethink the one storage for all html approach. (And also the parsing / pre-processing of the material crawled, but that's another story...)
On the other hand, the distinction between metadata and the true text content associated with the page doesn't need to be so marked. By "true text content", I mean [possibly partially marked-up] text from the web pages that is otherwise free of all other "Web 2.0 noise") Many search engines, including Solr (since you mentioned Lucene) now allow mixing the two genres, in the form of semi-structured data. For operational purposes (eg to task the crawlers etc.), you may keep a relational store with management related metadata, but the idea is that for search purposes, fielded and free-text info can coexist nicely (at the cost of pre-processing much of the input data).
It sounds to me like you need a content management system. Check out Plone. If that's not what you want maybe a web framework, like Grok, BFG, Django, Turbogears, or anything on this list. If that isn't good either, then I don't know what you are asking. :-)

Categories