Automated Class timetable optimize crawler? - python

Overall Plan
Get my class information to automatically optimize and select my uni class timetable
Overall Algorithm
Logon to the website using its
Enterprise Sign On Engine login
Find my current semester and its
related subjects (pre setup)
Navigate to the right page and get the data from each related
subject (lecture, practical and
workshop times)
Strip the data of useless
information
Rank the classes which are closer
to each other higher, the ones on
random days lower
Solve a best time table solution
Output me a detailed list of the
BEST CASE information
Output me a detailed list of the
possible class information (some
might be full for example)
Get the program to select the best
classes automatically
Keep checking to see if we can
achieve 7.
6 in detail
Get all the classes, using the lectures as a focus point, would be highest ranked (only one per subject), and try to arrange the classes around that.
Questions
Can anyone supply me with links to something that might be similar to this hopefully written in python?
In regards to 6.: what data structure would you recommend to store this information in? A linked list where each object of uniclass?
Should i write all information to a text file?
I am thinking uniclass to be setup like the following
attributes:
Subject
Rank
Time
Type
Teacher
I am hardly experienced in Python and thought this would be a good learning project to try to accomplish.
Thanks for any help and links provided to help get me started, open to edits to tag appropriately or what ever is necessary (not sure what this falls under other than programming and python?)
EDIT: can't really get the proper formatting i want for this SO post ><

Depending on how far you plan on taking #6, and how big the dataset is, it may be non-trivial; it certainly smacks of NP-hard global optimisation to me...
Still, if you're talking about tens (rather than hundreds) of nodes, a fairly dumb algorithm should give good enough performance.
So, you have two constraints:
A total ordering on the classes by score;
this is flexible.
Class clashes; this is not flexible.
What I mean by flexible is that you can go to more spaced out classes (with lower scores), but you cannot be in two classes at once. Interestingly, there's likely to be a positive correlation between score and clashes; higher scoring classes are more likely to clash.
My first pass at an algorithm:
selected_classes = []
classes = sorted(classes, key=lambda c: c.score)
for clas in classes:
if not clas.clashes_with(selected_classes):
selected_classes.append(clas)
Working out clashes might be awkward if classes are of uneven lengths, start at strange times and so on. Mapping start and end times into a simplified representation of "blocks" of time (every 15 minutes / 30 minutes or whatever you need) would make it easier to look for overlaps between the start and end of different classes.

BeautifulSoup was mentioned here a few times, e.g get-list-of-xml-attribute-values-in-python.
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."
Valuable data that was once locked up in poorly-designed websites is now within your reach. Projects that would have taken hours take only minutes with Beautiful Soup.

There are waaay too many questions here.
Please break this down into subject areas and ask specific questions on each subject. Please focus on one of these with specific questions. Please define your terms: "best" doesn't mean anything without some specific measurement to optimize.
Here's what I think I see in your list of topics.
Scraping HTML
1 Logon to the website using its Enterprise Sign On Engine login
2 Find my current semester and its related subjects (pre setup)
3 Navigate to the right page and get the data from each related subject (lecture, practical and workshop times)
4 Strip the data of useless information
Some algorithm to "rank" based on "closer to each other" looking for a "best time". Since these terms are undefined, it's nearly impossible to provide any help on this.
5 Rank the classes which are closer to each other higher, the ones on random days lower
6 Solve a best time table solution
Output something.
7 Output me a detailed list of the BEST CASE information
8 Output me a detailed list of the possible class information (some might be full for example)
Optimize something, looking for "best". Another undefinable term.
9 Get the program to select the best classes automatically
10 Keep checking to see if we can achieve 7.
BTW, Python has "lists". Whether or not they're "linked" doesn't really enter into it.

Related

Scraping web sites with computer vision

I have been given the task to scrape a high number of websites. All of them represent (visually speaking) the data I'm interested in in a similar way. Each one of those websites has a product-details-view (so to call it). And all of the views contain the same information: a product title, price, maybe some images, a description, etc...
If I had to scrape 10 sites, I'd write 10 if/else or case in order to handle them, but I'm afraid the number of websites is quite bigger. And thus I'm getting into a whole other problem.
Then I figured out I'd use "computer vision" and "machine learning". That sounds reasonable in the sense of having almost identical websites and "teaching" an algorithm how to "see" the data I'm interested in.
My strategy, so far, is to render each product-detail-view in a headless chrome (controlled with selenium), take a screenshot and split the visual representation of the website into chunks: left column, main, right column. Then split the "main" part into several chunks: title, breadcrumb, content, etc...
Unfortunately I'm not really sure how to actually split the screenshot into chunks. I have been looking at OpenCV's docs, but I'm not sure it's suited for that concrete purpose (or is it?).
Are there any other libraries that would be a better fit for what I'm trying to do? Also, does my strategy sound good or are there better ways of approaching this problem?
PS: Diffbot, Import.io and similar are not an option. Please don't suggest them.
You can try to solve the problem more engineering approach instead of machine learning. I mean to have one code for all the websites, but different configs for each of them. Some example of the config:
title: '#title_id',
description: '#description_id',
price: '#price_id'
Such approach will need some support in the future, because markup can be changed. But can be good to start for now.

What is the most efficient way to extract visible data from a poker room and how does one implement this?

So I'm new to python and just finished my first application. (Giving random chords to be played on a midi piano and increasing the score if the right notes are hit in a graphical interface, nothing too fancy but also non-trivial.) And now I'm looking for a new challenge, this time I'm going to try and create a program that monitors a poker table and collects data on all the players. Though this is completely allowed on almost all poker rooms (example of the largest one) there is obviously no set and go API available. This probably makes the extraction of relevant data the most challenging part of the entire program. In my search for more information, I came across an undergraduate thesis that goes in to writing such a program using Java (Internet Poker: Data Collection and Analysis - Haruyoshi Sakai).
In this thesis, the author speaks of 3 data collection methods:
Sniffing packets
Hand history
Scraping the screen
Like the author, I've come to the conclusion that the third option is probably the best route, but unlike him I have no knowledge of how to start this.
What I do know is the following: Any table will look like the image below. Note how text, including numbers is written in the same font on the table. Additionally, all relevant information is also supplied in the chat box situated in the lower left corner of the window.
In some regards using the chat box sounds like the best way to go, seeing as all text is predictable and in the same font. The problem I see is computational speed: It will often occur that many actions get executed in rapid succession. Any program will have to be able to keep up with this.
On the other hand, using the table as reference means that you have to deal with unpredictable bet locations.
The plan: Taking this in to a count, I'd start by getting an index of all player's names and stacks from the table view and "initialising" the table that way, and continue to use their stacks to extrapolate the betting they do.
The Method: Of course, the method is the entire reason why I made this post. It seems to me like one would need some sort of OCR to achieve all this, but seeing as everything is in a known font, there may be some significant optimisations that can be made. I would love some input on resources to learn about solutions to similar problems. Or if you've got a better idea on how to tackle this problem, I'd love to hear that too!
Please do be sure to ask any questions you may have, I will be happy to answer them in as much detail as possible.

Building an HTML Diff/Patch Algorithm

A description of what I'm going to accomplish:
Input 2 (N is not essential) HTML documents.
Standardize the HTML format
Diff the two documents -- external styles are not important but anything inline to the document will be included.
Determine delta at the HTML Block Element level.
Expanding the last point:
Imagine two pages of the same site that both share a sidebar with what was probably a common ancestor that has been copy/pasted. Each page has some minor changes to the sidebar. The diff will reveal these changes, then I can "walk up" the DOM to find the first common block element shared by them, or just default to <body>. In this case, I'd like to walk it up and find that, oh, they share a common <div id="sidebar">.
I'm familiar with DaisyDiff and the application is similar -- in the CMS world.
I've also begun playing with the google diff-patch library.
I wanted to give ask this kind of non-specific question to hopefully solicit any advise or guidance that anybody thinks could be helpful. Currently if you put a gun to my head and said "CODE IT" I'd rewrite DaisyDiff in Python and add-in this block-level logic. But I thought maybe there's a better way and the answers to Anyone have a diff algorithm for rendered HTML? made me feel warm and fuzzy.
If you were going to start from scratch, a useful search term would be "tree diff".
There's a pretty awesome blog post here, although I just found it by googling "daisydiff python" so I bet you've already seen it. Besides all the interesting theoretical stuff, he mentions the existence of Logilab's xmldiff, an open-source XML differ written in Python. That might be a decent starting point — maybe less correct than trying to wrap or reimplement DaisyDiff, but probably easier to get up and running quickly.
There's also html-tree-diff on pypi, which I found via this Quora link: http://www.quora.com/Is-there-any-good-Python-implementation-of-a-tree-diff-algorithm
There's some theoretical stuff about tree diffing at efficient diff algorithm for trees and Levenshtein distance on cstheory.stackexchange.
BTW, just to clarify, you are talking about diffing two DOM trees, but not necessarily rendering the diff/merge back into any particular HTML, right? (EDIT: Right.) A lot of the similarly-worded questions on here are really asking "how can I color deleted lines red and added lines green" or "how can I make matching paragraphs line up visually", skipping right over the theoretical hard part of "how do I diff two DOM trees in the first place" and the practical hard part of "how do I parse possibly malformed HTML into a DOM tree even before that". :)
I know this questions is related to python but you could take a look 3DM - XML 3-way Merging and Differencing Tool (default implementation in java) but here is the actual paper describing the algorithm used http://www.cs.hut.fi/~ctl/3dm/thesis.pdf, and here is the link to the site.
Drawback to this is that you do have to cleanup the document and be able to pars it as XML.
You could start by using beautifulsoup to parse both documents.
Then you have a choice:
use prettify to render both documents as more or less standardized HTML and diff those.
compare the parse trees.
The latter allows you to e.g. discard elements that only affect the presentation, not the content. The former is probably easier.

How to get related topics from a present wikipedia article?

I am writing a user-app that takes input from the user as the current open wikipedia page. I have written a piece of code that takes this as input to my module and generates a list of keywords related to that particular article using webscraping and natural language processing.
I want to expand the functionality of the app by providing in addition to the keywords that i have identified, a set of related topics that may be of interest to the user. Is there any API that wikipedia provides that will do the trick. If there isn't, Can anybody Point me to what i should be looking into (incase i have to write code from scratch). Also i will appreciate any pointers in identifying any algorithm that will train the machine to identify topic maps. I am not seeking any paper but rather a practical implementation of something basic
so to summarize,
I need a way to find topics related to current article in wikipedia (categories will also do)
I will also appreciate a sample algorithm for training a machine to identify topics that usually are related and clustered.
ps. please be specific because i have researched through a number of obvious possibilities
appreciate it thank you
You can scrape the categories if you want. If you're working with python, you can read the wikitext directly from their API, and use mwlib to parse the article and find the links.
A more interesting but harder to implement approach would be to create clusters of related terms, and given the list of terms extracted from an article, find the closest terms to them.
"See also" is a section often present in Wikipedia pages.
It is structured like the example below, from [[Article (publishing)]]:
==See also==
* [[Article directory]]
* [[Electronic article]]
You should then parse the wikicode (you can take that via dumps or the Mediawiki API, as hinted in the previous answers), and use the articles mentioned.
Another way is to use directly the Wikipedia categories, there are APIs for that.

What are some of the Artificial Intelligence (AI) related techniques one would use for parsing a webpage?

I would like to scrape several different discussions forums, most of which have different HTML formats. Rather than dissecting the HTML for each page, it would be more efficient (and fun) to implement some sort of Learning Algorithm that could identify the different messages (i.e. structures) on each page, and individually parse them while simultaneously ignoring all the extraneous crap (i.e., ads and other nonsense). Could someone please point me to some references or sample code for work that's already been carried out in this area.
Moreover, does anyone know of pseudocode for Arc90's readability code?
http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
build a solution that:
takes some sample webpages with the same structure (eg forum threads)
analyzes the DOM tree of each to find which parts are the same / different
where they are different is the dynamic content you are after (posts, user names, etc)
This technique is known as wrapper induction.
There seems to be a Python port of arc90's Readability script that might point you in the right direction (or at least some direction).
Maybe not exactly correct but there's an O'Reilly book called 'Collective Intelligence' that may lead you in the right direction for what you are attempting to do. Additionally, many of the examples are in python :)

Categories