Reading RSS Feeds: What Aggregators Do That I'm Not - python

I drop the following feed into Google Reader, and it update normally.
http://www.indeed.ca/rss?q=&l=Hamilton%2C+ON
However, when I use any of a number of approaches suggested thither and yon on the 'net that simply involve reading from this source and parsing the XML I receive the same 20 items.
What is Google Reader doing that I should be in my code so that I receive new items?
Thanks for your advice. Incidentally, I'm coding in Python.

RSS aggregators "poll" the sources, i.e., they repeat the HTTP query periodically on each source, and check if anything new appears in the results. That's unfortunate, as polling always is, as it wastes resources in an unending series of "are we there yet?" questions (kind of like taking a toddler along in a long car drive;-), and nevertheless implies delays (if you poll a given source every hour, say, you'll wait up to an hour to see some results).
Unfortunately, in the RSS architecture itself, there are no alternatives, no way to ask for a "callback" when new stuff appears or opt for a saner "publish-subscribe architecture".
A good effort to remedy that is pubsubhubbub, but it inevitably requires cooperation (above and beyond the RSS standards) from RSS sources and aggregators -- so it needs very wide takeup before it can be called "a solution" to the problem, though, technically, it already is (for cooperating sites;-).
So back to your question, you're doing nothing wrong: you just need to poll periodically, like RSS aggregators do, in order to get to see new results eventually.

1) Have you tried with other RSS feeds?
2) If so, it sounds like some kind of cache... Are you behind some proxy?

Related

Data analysis of log files – How to find a pattern?

My company has slightly more than 300 vehicle based windows CE 5.0 mobile devices that all share the same software and usage model of Direct Store Delivery during the day then doing a Tcom at the home base every night. There is an unknown event(s) that results in the device freaking out and rebooting itself in the middle of the day. Frequency of this issue is ~10 times per week across the fleet of computers that all reboot daily, 6 days a week. The math is 300*6=1800 boots per week (at least) 10/1800= 0.5%. I realize that number is very low, but it is more than my boss wants to have.
My challenge, is to find a way to scan through several thousand logfille.txt files and try to find some sort of pattern. I KNOW there is a pattern here somewhere. I’ve got a couple ideas of where to start, but I wanted to throw this out to the community and see what suggestions you all might have.
A bit of background on this issue. The application starts a new log file at each boot. In an orderly (control) log file, you see the app startup, do its thing all day, and then start a shutdown process in a somewhat orderly fashion 8-10 hours later. In a problem log file, you see the device startup and then the log ends without any shutdown sequence at all in a time less than 8 hours. It then starts a new log file which shares the same date as the logfile1.old that it made in the rename process. The application that we have was home grown by windows developers that are no longer with the company. Even better, they don’t currently know who has the source at the moment.
I’m aware of the various CE tools that can be used to detect memory leaks (DevHealth, retail messages, etc..) and we are investigating that route as well, however I’m convinced that there is a pattern to be found, that I’m just not smart enough to find. There has to be a way to do this using Perl or Python that I’m just not seeing. Here are two ideas I have.
Idea 1 – Look for trends in word usage.
Create an array of every unique word used in the entire log file and output a count of each word. Once I had a count of the words that were being used, I could run some stats on them and look for the non-normal events. Perhaps the word “purple” is being used 500 times in a 1000 line log file ( there might be some math there?) on a control and only 4 times on a 500 line problem log? Perhaps there is a unique word that is only seen in the problem files. Maybe I could get a reverse “word cloud”?
Idea 2 – categorize lines into entry-type and then look for trends in the sequence of type of entry type?
The logfiles already have a predictable schema that looks like this = Level|date|time|system|source|message
I’m 99% sure there is a visible pattern here that I just can’t find. All of the logs got turned up to “super duper verbose” so there is a boatload of fluff (25 logs p/sec , 40k lines per file) that makes this even more challenging. If there isn’t a unique word, then this has almost got to be true. How do I do this?
Item 3 – Hire a windows CE platform developer
Yes, we are going down that path as well, but I KNOW there is a pattern I’m missing. They will use the tools that I don’t have) or make the tools that we need to figure out what’s up. I suspect that there might be a memory leak, radio event or other event that platform tools I’m sure will show.
Item 4 – Something I’m not even thinking of that you have used.
There have got to be tools out there that do this that aren’t as prestigious as a well-executed python script, and I’m willing to go down that path, I just don’t know what those tools are.
Oh yeah, I can’t post log files to the web, so don’t ask. The users are promising to report trends when they see them, but I’m not exactly hopeful on that front. All I need to find is either a pattern in the logs, or steps to duplicate
So there you have it. What tools or techniques can I use to even start on this?
was wondering if you'd looked at the ELK stack? It's an acronym for elasticsearch, kibana and log stash and fits your use case closely; it's often used for analysis of large numbers of log files.
Elasticsearch and kibana gives you a UI that lets you interactively explore and chart data for trends. Very powerful and quite straight forward to set up on a Linux platform and there's a Windows version too. (Took me a day or two of setup but you get a lot of functional power from it). Software is free to download and use. You could use this in a style similar to idea 1 / 2
https://www.elastic.co/webinars/introduction-elk-stack
http://logz.io/learn/complete-guide-elk-stack/
On the question of Python / idea 4 (which elk could be considered part of) I haven't done this for log files but I have used Regex to search and extract text patterns from documents using Python. That may also help you find patterns if you had some leads on the sorts of patterns you are looking for.
Just a couple of thoughts; hope they help.
There is no input data at all to this problem so this answer will be basically pure theory, a little collection of ideas you could consider.
To analize patterns out of a bunch of many logs you could definitely creating some graphs displaying relevant data which could help to narrow the problem, python is really very good for these kind of tasks.
You could also transform/insert the logs into databases, that way you'd be able to query the relevant suspicious events much faster and even compare massively all your logs.
A simpler approach could be just focusing on a simple log showing the crash, instead wasting a lot of efforts or resources trying to find some kind of generic pattern, start by reading through one simple log in order to catch suspicious "events" which could produce the crash.
My favourite approach for these type of tricky problems is different from the previous ones, instead of focusing on analizing or even parsing the logs I'd just try to reproduce the bug/s in a deterministic way locally (you don't even need to have the source code). Sometimes it's really difficult to replicate the production environment in your the dev environment but definitely is time well invested. All the effort you put into this process will help you to solve not only these bugs but improving your software much faster. Remember, the more times you're able to iterate the better.
Another approach could just be coding a little script which would allow you to replay logs which crashed, not sure if that'll be easy in your environment though. Usually this strategy works quite well with production software using web-services where there will be a lot of tuples with data-requests and data-retrieves.
In any case, without seeing the type of data from your logs I can't be more specific nor giving much more concrete details.

How to properly unit test a web app?

I'm teaching myself backend and frontend web development (I'm using Flaks if it matters) and I need few pointers for when it comes to unit test my app.
I am mostly concerned with these different cases:
The internal consistency of the data: that's the easy one - I'm aiming for 100% coverage when it comes to issues like the login procedure and, most generally, checking that everything that happens between the python code and the database after every request remain consistent.
The JSON responses: What I'm doing atm is performing a test-request for every get/post call on my app and then asserting that the json response must be this-and-that, but honestly I don't quite appreciate the value in doing this - maybe because my app is still at an early stage?
Should I keep testing every json response for every request?
If yes, what are the long-term benefits?
External APIs: I read conflicting opinions here. Say I'm using an external API to translate some text:
Should I test only the very high level API, i.e. see if I get the access token and that's it?
Should I test that the returned json is what I expect?
Should I test nothing to speed up my test suite and don't make it dependent from a third-party API?
The outputted HTML: I'm lost on this one as well. Say I'm testing the function add_post():
Should I test that on the page that follows the request the desired post is actually there?
I started checking for the presence of strings/html tags in the row response.data, but then I kind of gave up because 1) it takes a lot of time and 2) I would have to constantly rewrite the tests since I'm changing the app so often.
What is the recommended approach in this case?
Thank you and sorry for the verbosity. I hope I made myself clear!
Most of this is personal opinion and will vary from developer to developer.
There are a ton of python libraries for unit testing - that's a decision best left to you as the developer of the project to find one that fits best with your tool set / build process.
This isn't exactly 'unit testing' per se, I'd consider it more like integration testing. That's not to say this isn't valuable, it's just a different task and will often use different tools. For something like this, testing will pay off in the long run because you'll have piece of mind that your bug fixes and feature additions aren't impacting your end to end code. If you're already doing it, I would continue. These sorts of tests are highly valuable when refactoring down the road to ensure consistent functionality.
I would not waste time testing 3rd party APIs. It's their job to make sure their product behaves reliably. You'll be there all day if you start testing 3rd party features. A big reason to use 3rd party APIs is so you don't have to test them. If you ever discover that your app is breaking because of a 3rd party API it's probably time to pick a different API. If your project scales to a size where you're losing thousands of dollars every time that API fails you have a whole new ball of issues to deal with (and hopefully the resources to address them) at that time.
In general, I don't test static content or html. There are tools out there (web scraping tools) that will let you troll your own website for consistent functionality. I would personally leave this as a last priority for the final stages of refinement if you have time. The look and feel of most websites change so often that writing tests isn't worth it. Look and feel is also really easy to test manually because it's so visual.

What are some of the Artificial Intelligence (AI) related techniques one would use for parsing a webpage?

I would like to scrape several different discussions forums, most of which have different HTML formats. Rather than dissecting the HTML for each page, it would be more efficient (and fun) to implement some sort of Learning Algorithm that could identify the different messages (i.e. structures) on each page, and individually parse them while simultaneously ignoring all the extraneous crap (i.e., ads and other nonsense). Could someone please point me to some references or sample code for work that's already been carried out in this area.
Moreover, does anyone know of pseudocode for Arc90's readability code?
http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
build a solution that:
takes some sample webpages with the same structure (eg forum threads)
analyzes the DOM tree of each to find which parts are the same / different
where they are different is the dynamic content you are after (posts, user names, etc)
This technique is known as wrapper induction.
There seems to be a Python port of arc90's Readability script that might point you in the right direction (or at least some direction).
Maybe not exactly correct but there's an O'Reilly book called 'Collective Intelligence' that may lead you in the right direction for what you are attempting to do. Additionally, many of the examples are in python :)

Data recognition, parsing, filtering, and transformation -- GUI?

Looking for a non-cloud based open source app for doing data transformation; though for a killer (and I mean killer) app just built for data transformations, I might be willing to spend up to $1000.
I've looked at Perl, Kapow Katalyst, Pentaho Kettle, and more.
Perl, Python, Ruby which are clearly languages, but unable to find any frameworks/DSLs just for processing data; meaning they're really not a great development environments, meaning there's no built GUI's for building RegEx, Input/Output (CSV, XML, JDBC, REST, etc.), no debugger for testing rows and rows of data -- they're not bad either, just not what I'm looking for, which is a GUI built for complex data transformations; that said, I'd love if the GUI/app file was in a scripting language, and NOT just stored in some not human readable XML/ASCII file.
Kapow Katalyst is made for accessing data via HTTP (HTML, CSS, RSS, JavaScript, etc.) it's got a nice GUI for transforming unstructured text, but that's not its core value offering, and is way, way too expensive. It does an okay job of traversing document namespace paths; guessing it's just XPath on the back-end, since the syntax appears to be the same.
Pentaho Kettle has a nice GUI for INPUT/OUTPUT of most common data stores, and its own take on handling data processing; which is okay, and just has a small learning curve. Kettle's debugger is ok, in that the data is easy to see, but the errors and exceptions are not threaded with the output, and there no way to really debug an issue; meaning you can't reload the output/error/exception, but are able to view the system feedback. All that said, Kettle data transformation is _______ well, let's just say it left me feeling like I must be missing something, because I was completely puzzled by "if it's not possible, just write the transformation in JavaScript"; umm, what?
So, any suggestions? Do realize that I haven't really spec'd out any transformations, but figure if you really use a product for data munging, I'd like to know about it; even excel, I guess.
In general though, currently I'm looking for a product that's able to handle 1000-100,000 rows with 10-100 columns. It'd be super cool if it could profile data sets, which is a feature Kettle sort of does, but not super well. I'd also like built in unit testing, meaning I'm able to build out control sets of data, and run changes made against the control set. Then I'd like to be able to selectively filter out rows and columns as I build out the transformation without altering the build; for example, I run a data set through transformation, filter the results, and the next run those sets are automatically blocked at the first "logical" occurrence; which in turn would mean less data to "look at" and a reduced runtime per each enhanced iteration; what would be crazy nice is if as I'd filtering out the rows/columns the app is tracking those, (and the output was filtered out). and unit tested/highlighted any changes. If I made a change that would effect the application logs and it's ability to track the unit tests based on me "breaking a branch" - it'd give me a warning, let me dump the data stored branch... and/or track the primary keys for difference in next generation of output, or even attempt to match them using fuzzy logic. And yes, I know this is a pipe dream, but hey, figured I'd ask, just in case there's something out there I've just never seen.
Feel free to comment, I'd be happy to answer any questions, or offer additional info.
Google Refine?
Talend will need more than 5 minutes of your time, perhaps closer to about 1 hour to begin to wire up a basic transformations and being able to fulfill your requirement to keep versioned control transformations as well. You described a Pipeline process that can be done easily in Talend when you know how, where you have multiple inputs and outputs in a project as the same raw data goes through various transformations and filtering, until it arrives as final output as you desired. Then you can schedule your jobs to repeat the process over similar data. Go back and spend more time with Talend, and you'll succeed in what you need, I'm sure.
I also happen to be one of the committers of Google Refine and also use Talend in my daily work. I actually sometimes model my transformations for Talend first in Google Refine. (Sometimes even using Refine to perform cleanup on borked ETL transforms themselves! LOL ) I can tell you that my experience with Talend played a small part in a few of the features of Google Refine. For instance, both Talend and Google Refine have the concept of an expression editor for your transformations (Talend goes down to Java language for this if need be).
Google Refine will never be an ETL tool, in the sense that we have not designed it to compete in that space were ETL is typically used for large data warehouse backend processing & transformations. However, we designed Google Refine to compliment existing ETL tools like Talend by allowing easy live previewing to make informed decisions about your transformations and cleanup, and if your data isn't incredibly huge, then you might opt to perform what you need within Refine itself.
I'm not sure exactly what kind of data or exactly what kind of transformations you're trying to do, but if it's primarily mathematical transformation, perhaps you can try FreeMat, Octave, or SciLab. If it's more data-warehouse-style munging, try open source ETL tools like Clover, Talend, JasperETL Community Edition, or Jitterbit.

What's the best way to implement web service for ajax autocomplete

I'm implementing a "Google Suggest" like autocomplete feature for tag searching using jQuery's autocomplete.
I need to provide a web service to jQuery giving it a list of suggestions based on what the user has typed. I see 2 ways of implementing the web service:
1) just store all the tags in a database and search the DB using user input as prefix. This is simple, but I'm concerned about latency.
2) Use an in-process trie to store all the tags and search it for matching results. As everything will be in-process, I expect this to have much lower latency. But there are several difficulties:
-What's a good way to initialize the trie on process start up? Presumable I'll store the tag data in a DB and retrieve them and turn them into a trie when I frist start up the process. But I'm not sure how. I'm using Python/Django.
-When a new tag is created by a user, I need to insert the new tag into the trie. But let's say I have 5 Django processes and hence 5 tries, how do I tell the other 4 tries that they need to insert a new tag too?
-How to make sure the trie is threadsafe as my Django processes will be threaded (I'm using mod_wsgi). Or do I not have to worry about threadsafty because of Python's GIL?
-Any way I can store the tag's frequency of use within the trie as well? How do I tell when does the tag's string end and when does the frequency start - eg. if I store apple213 into the trie, is it "apple" with frequency 213 or is it "apple2" with frequency 13??
Any help on the issues above or any suggestions on a different approach would be really appreciated.
Don't be concerned about latency before you measure things -- make up a bunch of pseudo-tags, stick them in the DB, and measure latencies for typical queries. Depending on your DB setup, your latency may be just fine and you're spared wasted worries.
Do always worry about threading, though - the GIL doesn't make race conditions go away (control might switch among threads at any pseudocode instruction boundary, as well as when C code in an underlying extension or builtin is executing). You need first to check the threadsafety attribute of the DB API module you're using (see PEP 249), and then use locking appropriately or spawn a small pool of dedicated threads that perform DB interactions (receiving requests on a Queue.Queue and returning results on another, the normal architecture for sound and easy threading in Python).
I would use the first option. 'KISS' - (Keep It Simple Stupid).
For small amounts of data there shouldn't be much latency. We run the same kind of thing for a name search and results appear pretty quickly on a few thousand rows.
Hope that helps,
Josh

Categories