Python pattern matching - python

I'm currently in the process of converting an old bash script of mine into a Python script with added functionality. I've been able to do most things, but I'm having a lot of trouble with Python pattern matching.
In my previous script, I downloaded a web page and used sed to get the elemented I wanted. The matching was done like so (for one of the values I wanted):
PM_NUMBER=`cat um.htm | LANG=sv_SE.iso88591 sed -n 's/.*ol.st.*pm.*count..\([0-9]*\).*/\1/p'`
It would match the number wrapped in <span class="count"></span> after the phrase "olästa pm". The markup I'm running this against is:
<td style="padding-left: 11px;">
<a href="/abuse_list.php">
<img src="/gfx/abuse_unread.png" width="15" height="12" alt="" title="9 anmälningar" />
</a>
</td>
<td align="center">
<a class="page_login_text" href="/pm.php" title="Du har 3 olästa pm.">
<span class="count">3</span>
</td>
<td style="padding-left: 11px;" align="center">
<a class="page_login_text" href="/blogg_latest.php" title="Du har 1 ny bloggkommentar">
<span class="count">1</span>
</td>
<td style="padding-left: 11px;" align="center">
<a class="page_login_text" href="/user_guestbook.php" title="Min gästbok">
<span class="count">1</span>
</td>
<td style="padding-left: 11px;" align="center">
<a class="page_login_text" href="/forum.php?view=3" title="Du har 1 ny forumkommentar">
<span class="count">1</span>
</td>
<td style="padding-left: 11px;" align="center">
<a class="page_login_text" href="/user_images.php?user_id=162005&func=display_new_comments" title="Du har 1 ny albumkommentar">
<span class="count">1</span>
</td>
<td style="padding-left: 11px;" align="center">
<a class="page_login_text" href="/forum_favorites.php" title="Du har 2 uppdaterade trådar i "bevakade trådar"">
<span class="count">2</span>
</td>
I'm hesitant to post this, because it seems like I'm asking for a lot, but could someone please help me with a way to parse this in Python? I've been pulling my hair trying to do this, but regular expressions and I just don't match (pardon the pun). I've spent the last couple of hours experimenting and reading the Python manual on regular expressions, but I can't seem to figure it out.
Just to make it clear, what I need are 7 different expressions for matching the number within <span class="count"></span>. I need to, for example, be able to find the number of unread PMs ("olästa pm").

You will not parse html yourself. You will use a html parser built in python to parse the html.
Lightweight xml dom parser in python
Beautiful Soup

You can user lxml to pull out the values you are looking for pretty easily with xpaths
lxml
xpath
Example
from lxml import html
page = html.fromstring(open("um.htm", "r").read())
matches = page.xpath("//a[contains(#title, 'pm.') or contains(#title, 'ol')]/span")
print [elem.text for elem in matches]

use either:
BeautifulSoup
lxml
parsing HTML with regexes is a recipe for disaster.

It is impossible to reliably match HTML using regular expressions. It is usually possible to cobble something together that works for a specific page, but it is not advisable as even a subtle tweak to the source HTML can render all your work useless. HTML simply has a more complex structure than Regex is capable of describing.
The proper solution is to use a dedicated HTML parser. Note that even XML parsers won't do what you need, not reliably anyway. Valid XHTML is valid XML, but even valid HTML is not, even though it's quite similar. And valid HTML/XHTML is nearly impossible to find in the wild anyway.
There are a few different HTML parsers available:
BeautifulSoup is not in the standard library, but it is the most forgiving parser, it can handle almost all real-world HTML and it's designed to do exactly what you're trying to do.
HTMLParser is included in the Python standard library, but it is fairly strict about accepting only valid HTML.
htmllib is also in the standard library, but is deprecated.
As other people have suggested, BeautifulSoup is almost certainly your best choice.

Related

Extracting text from html straight into a variable

I'm trying to extract a line of text from a html file straight into a variable, however, I have found no solution to the problem despite hours of searching. Beautiful Soup looks helpful, how would I be able to simply pick out a desired string as an input and then extract it from the html source right into a variable?
I've been trying to use request.text and Beutiful soup to scrape the entire page but it seems there is no function to directly do it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
def extract(url):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
return [item.text for item in soup.find_all('<DIV ALIGN="justify"')]
<HMTL>
<HEAD>
<TITLE>webpage1</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFf" LINK="006666" ALINK="8B4513" VLINK="006666">
<TABLE WIDTH="75%" ALIGN="center">
<TR>
<TD>
<DIV ALIGN="center"><H1>STARTING . . . </H1></DIV>
<DIV ALIGN="justify"><P>There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language - HTML.
<BR>
<P>HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!</P>
When I run I would like for it to return the string
<P>There are lots of ways to create web pages

click Menu item with same ID using Selenium Python

Below is my HTML snippet and code I tried. I need to Click the Integrated Consoles menu item. I tried like below, But nothing happens and no error as well. Kindly help me to select the Specific Menu Item using the text inside the tag.
driver.find_element_by_xpath(".//td[contains text(),'Integrated Consoles']").click()
HTMl sample Snippet
<td nowrap="" id="MENU_TD110"> Integrated Consoles </td>
<td nowrap="" id="MENU_TD110"> System Information </td>
<td nowrap="" id="MENU_TD110"> More Tools </td>
Parentheses () are missing inside your contains method just enclose like below and try -
driver.find_element_by_xpath(".//td[contains(text(),'Integrated Consoles')]").click()

How to extract rating information using CSS selector or any other methods

I am learning web scraping on my own and I am trying to scrap reviewer's ratings on Yelp as a practice. Typically, I can use CSS selector or XPath methods to select the contents I am interested in. However, those methods do not work for selecting reviewers' ratings. For instance, on the following page: https://www.yelp.com/user_details_reviews_self?userid=0S6EI51ej5J7dgYz3-O0lA. The CSS selector for the first rating is '.stars_2'. However, if I use this selector in my RSelenium code as follows:
ratings=remDr$findElements('css selector','.stars_2')
ratings=unlist(lapply(ratings, function(x){x$getElementText()}))
I get NULL. I think the reason is that the rating is actually a image. I paste a small part of the page source here:
<div class="review-content">
<div class="review-content">
<div class="biz-rating biz-rating-very-large clearfix">
<div>
<div class="rating-very-large">
<i class="star-img stars_2" title="2.0 star rating">
<img alt="2.0 star rating" class="offscreen" height="303" src="//s3-media4.fl.yelpcdn.com/assets/srv0/yelp_styleguide/c2252a4cd43e/assets/img/stars/stars_map.png" width="84">
</i>
</div>
</div>
Basically, if I can extract the text from class="stat-img stars_2" or title="2.0 star rating" then I am good. Can anyone help me on this?
You might want to try something like this approach:
Using the Yelp API with R, attempting to search business types using geo-coordinates
though it seems some folks found this outdated, I found some useful code on the Yelp github page:
https://github.com/Yelp/yelp-api/pull/88
https://github.com/Yelp/yelp-api/pull/88/commits/95009afde2b47e8244fda3d435f0476205cc0039
Good luck!
:)

how to generate graphics from python (or printable tables)?

I would like to write in python a generator of multiplication tables for my children. I imagine something like a 10x10 table with 20 or 30 of the cells randomly bolded (a thicker border). What would be a good method to generate the printable output?
I am tentatively thinking of generating a LaTeX file but there may be a simpler (more pythonic, less dependencies) solution?
UPDATE: if someone is interested in the code to generate the above I posted it to bitbucket.org. This is an alpha version form a "Sunday developper" as we say in France (which means that the code is ugly and that you must not use in any circumstances when developing space shuttle management software :))
You might want to use HTML and CSS instead of Latex, it's a little bit simpler and cleaner, and just as printable.
<html>
<head>
<style>
table {border-collapse: collapse}
td { border:1px solid black; }
td.bolded { border:3px solid black }
</style>
</head>
<body>
<table>
<tr>
<td> 1 </td> <td> 2 </td> <td class="bolded"> 3 </td>
</tr>
</table>
</body>
</html>

Parsing Web Page's Search Results With Python

I recently started working on a program in python which allows the user to conjugate any verb easily. To do this, I am using the urllib module to open the corresponding conjugations web page. For example, the verb "beber" would have the web page:
"http://www.spanishdict.com/conjugate/beber"
To open the page, I use the following python code:
source = urllib.urlopen("http://wwww.spanishdict.com/conjugate/beber").read()
This source does contain the information that I want to parse. But, when I make a BeautifulSoup object out of it like this:
soup = BeautifulSoup(source)
I appear to lose all the information I want to parse. The information lost when making the BeautifulSoup object usually looks something like this:
<tr>
<td class="verb-pronoun-row">
yo </td>
<td class="">
bebo </td>
<td class="">
bebí </td>
<td class="">
bebía </td>
<td class="">
bebería </td>
<td class="">
beberé </td>
</tr>
What am I doing wrong? I am no professional at Python or Web Parsing in general, so it may be a simple problem.
Here is my complete code (I used the "++++++" to differentiate the two):
import urllib
from bs4 import BeautifulSoup
source = urllib.urlopen("http://www.spanishdict.com/conjugate/beber").read()
soup = BeautifulSoup(source)
print source
print "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"
print str(soup)
When I wrote parsers I've had problems with bs, in some cases, it didn't find that found lxml and vice versa, because of broken html.
Try to use lxml.html.
Your problem may be with encoding. I think that bs4 works with utf-8 and you have a different encoding set on your machine as default(an encoding that contains spanish letters). So urllib requests the page in your default encoding,thats okay so data is there in the source, it even prints out okay, but when you pass it to utf-8 based bs4 that characters are lost. Try looking for setting a different encoding in bs4 and if possible set it to your default. This is just a guess though, take it easy.
I recommend using regular expressions. I have used them for all my web crawlers. If this is usable for you depends on the dynamicity of the website. But that problem is there even when you use bs4. You just write all your re manually and let it do the magic. You would have to work with the bs4 similiar way when looking foor information you want.

Categories