Using Regular Expression with Twill - python

I'm currently using urllib2 and BeautifulSoup to open and parse html data. However I've ran into a problem with a site that uses javascript to load the images after the page has been rendered (I'm trying to find the image source for a certain image on the page).
I'm thinking Twill could be a solution, and am trying to open the page and use a regular expression with 'find' to return the html string I'm looking for. I'm having some trouble getting this to work though, and can't seem to find any documentation or examples on how to use regular expressions with Twill.
Any help or advice on how to do this or solve this problem in general would be much appreciated.

I'd rather user CSS selectors or "real" regexps on page source. Twill is AFAIK not being worked on. Have you tried BS or PyQuery with CSS selectors?

Twill does not work with javascript (see http://twill.idyll.org/browsing.html)
use webdriver if you want to handle javascript

Related

Is it possible to use Selenium to fetch page source, then use lxml to scrape data by xpath?

Selenium can be used to navigate a web site (login, get html source of a page on the site),
but then there is nothing in Selenium that will find/get data in that HTML by xpath (find_element_by_xpath() will find elements, but not TEXT data outside of tags, and therefore something else must be used like lxml), Selenium absolutely cannot be used to do this, as when you try, it throws an error.
There are no examples anywhere of using Selenium to get the HTML source, passing that to lxml to parse the HTML and find / get data by xpath anywhere on the web.
It is not to be found.
lxml examples are usually given in conjunction with the Python 'requests' library from which the response in bytes (response.content) is obtained.
lxml uses this response.content (bytes), but with lxml, no functions accept the HTML as a string.
Selenium only returns html as a string: self.driver.page_source
So what to do here?
I need to use lxml, because it provides xpath capability.
I cannot use Python's requests library to login to a web site and navigate to a page, it just does not work with this site because of some complexities of how they designed things.
Selenium is the only thing that will work to login, create a session, pass the right cookies on a subsequent GET request.
I need to use selenium and 'page_source' (string), but I am not sure how to convert to the exact 'bytes' that the functions 'lxml' requires.
It's proving quite difficult to scrape using Python with the way the libraries here do not work together and lack of options with Selenium to produce the HTML as bytes,
and the lack of lxml to accept data either as string or bytes.
any and all help would be appreciated, but I don't believe it can be answered unless you have specifically experienced this problem, and have successfully used Selenium + lxml together.
Try something along these lines and see if it works for you:
data = self.driver.page_source
doc = lxml.html.fromstring(data)
target = doc.xpath('some xpath')

Finding specific names on a website using Python

I need to make an app that uses Python to search for specific names on a website. For instance, I have to check if the string "Robert Paulson" is being used on a website. If it is, it returns True. Else, false. Also,is there any library that can help me make that?
Since you have not attempted to make your application first, then I am not going to post code for you. I will however, suggest using:
urllib2:
A robust module for interacting with webpages. i.e. pull back the html of a webpage.
BeautifulSoup (from bs4 import BeautifulSoup):
An awesome module to "regex" html to find what is is that you're looking for.
Good luck my friend!
You could do something similar to this other answer. You will just need the regex to find your string.
I have also used Selenium webdriver to solve some more complex webesite searching, although I think the link I provided would solve your problem more simply.

Processing badly formed HTML files with XPATH

I inherited someone elses (dreadful) codebase, and am currently desperately trying to fix things. Today, that means gathering a list of all the dead links in our template/homepage.
I'm currently using ElementTree in Python, trying to parse the site using xpath. Unfortunately, it seems that the html is malformed, and ElementTree keeps throwing errors.
Are there more error friendly xpath parsers? Is there a way to run ElementTree in a non-strict mode? Are there any other methods, such as preprocessing, that can be used to help this process?
LXML can parse some malformed HTML, implements an extended version of the ElementTree API, and supports XPath:
>>> from lxml import html
>>> t = html.fromstring("""<html><body>Hello! <p> Goodbye.</body></html""")
>>> html.tostring(t.xpath("//body")[0])
'<body>Hello! <p> Goodbye.</p></body>'
My commiserations!
You'd be better off parsing your HTML with BeautifulSoup. As the homepage states:
You didn't write that awful page. You're just trying to get some data
out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-turnaround screen
scraping projects.
and more importantly:
Beautiful Soup parses anything you give it, and does the tree
traversal stuff for you. You can tell it "Find all the links", or
"Find all the links of class externalLink", or "Find all the links
whose urls match "foo.com", or "Find the table heading that's got bold
text, then give me that text."
BeautifulSoup can very well deal with malformed HTML. You should also definitely look at How do I fix wrongly nested / unclosed HTML tags?. There, also Tidy was suggested.
This is a bit OT, but since it's the links you are interested in, you could also use an external link checker.
I've used Xenu Link Sleuth for years and it works great. I have a couple of sites that have more than 15,000 internal pages and running Xenu on the LAN with 30 simultaneous threads it takes about 5-8 minutes to check the site. All link types (pages, images, CSS, JS, etc.) are checked and there is a simple-but-useful exclusion mechanism. It runs on XP/7 with whatever authorization MSIE has, so you can check member/non-member views of your site.
Note: Do not run it when logged into an account that has admin privileges or it will dutifully wander backstage and start hitting delete on all your data! (Yes, I did that once -- fortunately I had a backup. :-)

What’s the most forgiving HTML parser in Python?

I have some random HTML and I used BeautifulSoup to parse it, but in most of the cases (>70%) it chokes. I tried using Beautiful soup 3.0.8 and 3.2.0 (there were some problems with 3.1.0 upwards), but the results are almost same.
I can recall several HTML parser options available in Python from the top of my head:
BeautifulSoup
lxml
pyquery
I intend to test all of these, but I wanted to know which one in your tests come as most forgiving and can even try to parse bad HTML.
They all are. I have yet to come across any html page found in the wild that lxml.html couldn't parse. If lxml barfs on the pages you're trying to parse you can always preprocess them using some regexps to keep lxml happy.
lxml itself is fairly strict, but lxml.html is a different parser and can deal with very broken html. For extremely brokeh html, lxml also ships with lxml.html.soupparser which interfaces with the BeautifulSoup library.
Some approaches to parsing broken html using lxml.html are described here: http://lxml.de/elementsoup.html
With pages that don't work with anything else (those that contain nested <form> elements come to mind) I've had success with MinimalSoup and ICantBelieveItsBeautifulSoup. Each can handle certain types of error that the other one can't so often you'll need to try both.
I ended up using BeautifulSoup 4.0 with html5lib for parsing and is much more forgiving, with some modifications to my code it's now working considerabily well, thanks all for suggestions.
If beautifulsoup doesn't fix your html problem, the next best solution would be regular expression. lxml, elementtree, minidom are very strict in parsing and actually they are doing right.
Other tips:
I feed the html to lynx browser through command prompt, and take out the text version of the page/content and parse using regex.
Converting to html to text or html to markdown strips all the html tags and you will remain with text. That is easy to parse.

How do I parse a website in python once I know its url?

If I know the url of a wiki site , How do I use python to parse the contents of it ?
This is a very broad question, but the first things to reach for are urllib, which will handle the downloading part, and Beautiful Soup, which will do the parsing. Gluing them together and writing the code to actually extract information from the parse tree is up to you.
You might try Scrapy as well.

Categories