parsing chat logs in python, currently using BeautifulSoup - python

I am having some issues parsing an IM chat log using Python 2.7. I am currently using BeautifulSoup.get_text. This generally works, but sometimes masks interesting stuff. For instance:
<font color="#A82F2F"><font size="2">(3/11/2016 3:11:57 PM)</font> <b>user name:</b></font> <html xmlns='http://jabber.org/protocol/xhtml-im'><body xmlns='http://www.w3.org/1999/xhtml'><p>Have you posted the key to https://___.edu/sshkeys/?</p></body></html><br/>
In this case, I get the Have you posted the key to part, but it strips out the https:________ part.
Most, not all, the lines are formatted the same. i.e. date time, user, interesting stuff.
Is there a better way to parse this to get the text AND all the interesting stuff?

You can utilize find_all:
for anchor in soup.find_all('a', href=True):
print("The anchor url={} text={}".format(anchor['href'], anchor['text'])
Depending on how you want to output this information, you'd have to get more or less clever.

Related

How to extract budget, gross, metascore from imdb using scrapy and beautifulsoup?

I am staring with the url below:
http://www.imdb.com/chart/top
The structure of the HTML file seems to be so confusing:
"
Metascore: "
I am trying to use a format like this:
movie['metascore'] = self.get_text(soup.find('h4', attrs={'&nbsp':'Metascore'}))
I'll take a stab at this since it sounds like you're new to scraping. What it sounds like you're actually trying to do is to get the budget, gross, and metascore from each of the individual 250 movie pages on IMDB. You're on the right track by mentioning Scrapy because you do have to crawl to those pages from the initial URL you provided. Scrapy has some excellent documentation, so if you want to use it, I highly recommend you start there first.
However, if all you need is to scrape those 250 pages, you're better off just using Beautiful Soup to do the whole job. Simply do a soup.findAll("td", {"class":"titleColumn"}), extract the links, then do a loop where you have Beautiful Soup open each of the those pages one at a time. If you're not sure how to do that, again, BS has excellent documentation.
From there, it's just a matter of scraping the relevant data you want during each iteration. For instance, the metascore of each film is inside the a <div> of the class star-box-details. Do a .find for that and then you'll have to do some regular expressions to extract the exact piece you want (regular-expressions.info has a great tutorial on regex and if you really get into regex, you'll probably end up sinking hours into RexEgg).
I'm not going to code the whole thing since you'll learn a lot through the trial and error that comes with attempting to solve things, but hopefully that puts you on the right track. However, do note that IMDB forbids scraping, but for small projects I'm sure no one will care. But if you want to get serious, the "Does IMDB provide an API?" post has some excellent resources for how to do it via various third-party APIs (and some even directly from IMDB). In your case, the best might be to simply download the data as text files directly from IMDB. Click on any of the FTP links. The files you'll probably want are business.list.gz and ratings.list.gz. As for the metascore on each movie page, that rating actually comes from Metacritic, so you'll want to go there to pull that data.
Good luck!

Reading text in elements using lxml.etree

I am using the Python version of the lxml libray. I am currently trying to parse the text from a table but am encountering a problem in that some of the text is links.
For example, one of the cells may look something like this:
<td>
Can I kick it, <a>to all the people</a> who can quest like a <a>tribe</a> does
</td>
Say after parsing the html, the td element is stored as foo. Then foo.text will not display the whole text, only the parts that aren't links. Moreover, if I find the link text using [i.text for i in foo.getchildren()] I no longer know the order in which to put the non-link text and link text.
Is there an easy way to get around this?
Well after searching for an hour, within 2 minutes of posting this question I have found the solution.
Use the method foo.text_content() and this will display what is needed.

Processing badly formed HTML files with XPATH

I inherited someone elses (dreadful) codebase, and am currently desperately trying to fix things. Today, that means gathering a list of all the dead links in our template/homepage.
I'm currently using ElementTree in Python, trying to parse the site using xpath. Unfortunately, it seems that the html is malformed, and ElementTree keeps throwing errors.
Are there more error friendly xpath parsers? Is there a way to run ElementTree in a non-strict mode? Are there any other methods, such as preprocessing, that can be used to help this process?
LXML can parse some malformed HTML, implements an extended version of the ElementTree API, and supports XPath:
>>> from lxml import html
>>> t = html.fromstring("""<html><body>Hello! <p> Goodbye.</body></html""")
>>> html.tostring(t.xpath("//body")[0])
'<body>Hello! <p> Goodbye.</p></body>'
My commiserations!
You'd be better off parsing your HTML with BeautifulSoup. As the homepage states:
You didn't write that awful page. You're just trying to get some data
out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-turnaround screen
scraping projects.
and more importantly:
Beautiful Soup parses anything you give it, and does the tree
traversal stuff for you. You can tell it "Find all the links", or
"Find all the links of class externalLink", or "Find all the links
whose urls match "foo.com", or "Find the table heading that's got bold
text, then give me that text."
BeautifulSoup can very well deal with malformed HTML. You should also definitely look at How do I fix wrongly nested / unclosed HTML tags?. There, also Tidy was suggested.
This is a bit OT, but since it's the links you are interested in, you could also use an external link checker.
I've used Xenu Link Sleuth for years and it works great. I have a couple of sites that have more than 15,000 internal pages and running Xenu on the LAN with 30 simultaneous threads it takes about 5-8 minutes to check the site. All link types (pages, images, CSS, JS, etc.) are checked and there is a simple-but-useful exclusion mechanism. It runs on XP/7 with whatever authorization MSIE has, so you can check member/non-member views of your site.
Note: Do not run it when logged into an account that has admin privileges or it will dutifully wander backstage and start hitting delete on all your data! (Yes, I did that once -- fortunately I had a backup. :-)

How to extract all the url's from a website?

I am writing a programme in Python to extract all the urls from a given website. All the url's from a site not from a page.
As I suppose I am not the first one who wants to do that I was wondering if there was a ready made solution or if I have to write the code myself.
It's not gonna be easy, but a decent starting point would be to look into these two libraries:
urllib
BeautifulSoup
I didn't see any ready made scripts that does this on a quick google search.
Using the scrapy framework makes this almost trivial.
The time consuming part would be learning how to use scrapy. THeir tutorials are great though and shoulndn't take you that long.
http://doc.scrapy.org/en/latest/intro/tutorial.html
Creating a solution that others can use is one of the joys of being part of a programming community. iF a scraper doesn't exist you can create one that everyone can use to get all links from a site!
The given answers are what I would have suggested (+1).
But if you really want to do something quick and simple, and you're on a *NIX platform, try this:
lynx -dump YOUR_URL | grep http
Where YOUR_URL is the URL that you want to check. This should get you all the links you want (except for links that are not fully written)
You first have to download the page's HTML content using a package like urlib or requests.
After that, you can use Beautiful Soup to extract the URLs. In fact, their tutorial shows how to extract all links enclosed in <a> elements as a specific example:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
If you also want to find links not enclosed in <a> elements, you'll may have to write something more complex on your own.
EDIT: I also just came across two Scrapy link extractor classes that were created specifically for this task:
http://doc.scrapy.org/en/latest/topics/link-extractors.html

python method to extract content (excluding navigation) from an HTML page

Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.
Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?
Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/
It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).
Here's an example from the google code page:
# Import the Template class.
>>> from templatemaker import Template
# Create a Template instance.
>>> t = Template()
# Learn a Sample String.
>>> t.learn('<b>this and that</b>')
# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'
# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True
# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'
You might use the boilerpipe Web application to fetch and extract content on the fly.
(This is not specific to Python, as you only need to issue a HTTP GET request to a page on Google AppEngine).
Cheers,
Christian
What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.
If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.
If you want to do it, go with a regexp on <p></p>, or parse the DOM.
Goose is just the library for this task. To quote their README:
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags

Categories