Parsing Web Page's Search Results With Python - python

I recently started working on a program in python which allows the user to conjugate any verb easily. To do this, I am using the urllib module to open the corresponding conjugations web page. For example, the verb "beber" would have the web page:
"http://www.spanishdict.com/conjugate/beber"
To open the page, I use the following python code:
source = urllib.urlopen("http://wwww.spanishdict.com/conjugate/beber").read()
This source does contain the information that I want to parse. But, when I make a BeautifulSoup object out of it like this:
soup = BeautifulSoup(source)
I appear to lose all the information I want to parse. The information lost when making the BeautifulSoup object usually looks something like this:
<tr>
<td class="verb-pronoun-row">
yo </td>
<td class="">
bebo </td>
<td class="">
bebí </td>
<td class="">
bebía </td>
<td class="">
bebería </td>
<td class="">
beberé </td>
</tr>
What am I doing wrong? I am no professional at Python or Web Parsing in general, so it may be a simple problem.
Here is my complete code (I used the "++++++" to differentiate the two):
import urllib
from bs4 import BeautifulSoup
source = urllib.urlopen("http://www.spanishdict.com/conjugate/beber").read()
soup = BeautifulSoup(source)
print source
print "+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++"
print str(soup)

When I wrote parsers I've had problems with bs, in some cases, it didn't find that found lxml and vice versa, because of broken html.
Try to use lxml.html.

Your problem may be with encoding. I think that bs4 works with utf-8 and you have a different encoding set on your machine as default(an encoding that contains spanish letters). So urllib requests the page in your default encoding,thats okay so data is there in the source, it even prints out okay, but when you pass it to utf-8 based bs4 that characters are lost. Try looking for setting a different encoding in bs4 and if possible set it to your default. This is just a guess though, take it easy.
I recommend using regular expressions. I have used them for all my web crawlers. If this is usable for you depends on the dynamicity of the website. But that problem is there even when you use bs4. You just write all your re manually and let it do the magic. You would have to work with the bs4 similiar way when looking foor information you want.

Related

Dynamic content not rendered when scraping using selenium

I am trying to use selenium for scraping (the script used to work in python 3.7).
Last week I had to reset my PC and I installed the latest versions of python and all the packages used in the script.
What I observed was that none of the dynamic values are getting rendered and are displayed with header tags. Please see below some of the outputs:
<tr>
<td class="textsr">Close</td>
<td class="textvalue">{{ScripHeaderData.Header.Close}}</td>
</tr>
<tr>
<td class="textsr">WAP</td>
<td class="textvalue">{{StkTrd.WAP}}</td>
</tr>
<tr>
<td class="textsr">Big Value</td>
<td class="textvalue">{{checknullheader(CompData.BigVal)?'-':(CompData.BigVal)}}</td>
</tr>
I have been using the script for my research purpose and need it back in shape, hence appreciate any guidance.
Here's the snippet for reference:
target_url = q.get(timeout=1)
time.sleep(1)
driver = webdriver.Chrome('./chromedriver',options=opts)
driver.get(target_url)
# this is just to ensure that the page is loaded
time.sleep(5)
html_content = driver.page_source
soup = BeautifulSoup(html_content, features="html.parser")
table_rows = soup.find_all('tr')
for row in table_rows:
table_cols = row.find_all('td')
for col in table_cols:
label_value = col.text
I had referred a lot of forums and tried many suggestions (waits, driver options, changing web drivers, switching content etc.) however my issue seems to be more specific and did not get resolved.
Eventually fell back to my old setup (runs python 3.9.6) and then it went back to working state.
Thanks to you Joe Carboni for your time and inputs on this.
It is a bit frustrating that I could not find the root cause of the issue and a workaround to resolve it. But just posting what I did here in case if it helps someone, cheers.
While it may be tempting to use time.sleep to wait for the page to load, it's better to use Selenium Waits with conditions to wait for, likely related to the elements you want.
https://www.selenium.dev/documentation/webdriver/waits/
Here's another thread with a good answer about Waits and conditions vs. time.sleep: How to sleep Selenium WebDriver in Python for milliseconds

Extracting text from html straight into a variable

I'm trying to extract a line of text from a html file straight into a variable, however, I have found no solution to the problem despite hours of searching. Beautiful Soup looks helpful, how would I be able to simply pick out a desired string as an input and then extract it from the html source right into a variable?
I've been trying to use request.text and Beutiful soup to scrape the entire page but it seems there is no function to directly do it.
from urllib.request import urlopen
from bs4 import BeautifulSoup
def extract(url):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
return [item.text for item in soup.find_all('<DIV ALIGN="justify"')]
<HMTL>
<HEAD>
<TITLE>webpage1</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFf" LINK="006666" ALINK="8B4513" VLINK="006666">
<TABLE WIDTH="75%" ALIGN="center">
<TR>
<TD>
<DIV ALIGN="center"><H1>STARTING . . . </H1></DIV>
<DIV ALIGN="justify"><P>There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language - HTML.
<BR>
<P>HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!</P>
When I run I would like for it to return the string
<P>There are lots of ways to create web pages

Execute Javascript function on website using Python

Is it possible to call a Javascript function on a website that I'm web scraping and saving the result of the function?
I'm using Requests to establish a connection and saving certain pages that I need and BeautifulSoup to make it readable and accessing certain parts.
There is one part that I'm not sure how to call, or even if it's possible:
<tr class=TRDark>
<td width=100% colspan=3>
<a href="" onclick="OpenPayPlan('payplan.asp?app=******');return false;">
Betalingsplan
</a>
</td>
</tr>
This function will open a new window and calculate some data that I need. Is this possible to do with Python?
I cannot use Selenium or similar programs for this. This must be executed in the terminal and only the terminal.
You need to find a JavaScript interpreter with Python bindings maybe. When you've found one which will fit with your needs you can read the documentation and there you can see how this interpreter works. An example could be pyv8. Python however, does not include a JavaScript interpreter.

Can't scrape nested html using BeautifulSoup

I have am interested in scraping "0.449" from the following source code from http://hdsc.nws.noaa.gov/hdsc/pfds/pfds_map_cont.html?Lat=33.146425&Lon=-87.5805543.
<td class="tblInner" id="0-0">
<div style="font-size:110%">
<b>0.449</b>
</div>
"(0.364-0.545)"
</td>
Using BeautifulSoup, I currently have written:
storm=soup.find("td",{"class":"tblInner","id":"0-0"})
which results in:
<td class="tblInner" id="0-0">-</td>
I am unsure of why everything nested within the td is not showing up. When I search the contents of the td, my result is simply "-". How can I scrape the value that I want from this code?
You are likely scraping a website that uses javascript to update the DOM after the initial load.
You have a couple choices:
Find out where did the javascript code that fills the HTML page got the data from and call this instead. The data most likely comes from an API that you can call directly with CURL. That's the best method 99% of the time.
Use a headless browser (zombie.js, ...) to retrieve the HTML code after the javascript changes it. Convenient and fast, but few tools in python to do this (google python headless browser).
Use selenium or splinter to remote control a real browser (chrome, firefox, ...). It's convenient and works in python, but slow as hell
Edit:
I did not see that you posted the url you wanted to scrape.
In your particular case, the data you want comes from an AJAX call to this URL:
http://hdsc.nws.noaa.gov/cgi-bin/hdsc/new/cgi_readH5.py?lat=33.1464&lon=-87.5806&type=pf&data=depth&units=english&series=pds
You now only need to understand what each parameter does, and parse the output of that instead of writing an HTML scraper.
Please excuse lack of error checking and modularity, but this should get you what you need, based on #Eloims observation:
import requests
import re
url = 'http://hdsc.nws.noaa.gov/cgi-bin/hdsc/new/cgi_readH5.py?lat=33.1464&lon=-87.5806&type=pf&data=depth&units=english&series=pds'
r = requests.get(url)
response = r.text
coord_list_text = re.search(r'quantiles = (.*);', response)
coord_list = eval(coord_list_text.group(1))
print coord_list[0][0]

Python pattern matching

I'm currently in the process of converting an old bash script of mine into a Python script with added functionality. I've been able to do most things, but I'm having a lot of trouble with Python pattern matching.
In my previous script, I downloaded a web page and used sed to get the elemented I wanted. The matching was done like so (for one of the values I wanted):
PM_NUMBER=`cat um.htm | LANG=sv_SE.iso88591 sed -n 's/.*ol.st.*pm.*count..\([0-9]*\).*/\1/p'`
It would match the number wrapped in <span class="count"></span> after the phrase "olästa pm". The markup I'm running this against is:
<td style="padding-left: 11px;">
<a href="/abuse_list.php">
<img src="/gfx/abuse_unread.png" width="15" height="12" alt="" title="9 anmälningar" />
</a>
</td>
<td align="center">
<a class="page_login_text" href="/pm.php" title="Du har 3 olästa pm.">
<span class="count">3</span>
</td>
<td style="padding-left: 11px;" align="center">
<a class="page_login_text" href="/blogg_latest.php" title="Du har 1 ny bloggkommentar">
<span class="count">1</span>
</td>
<td style="padding-left: 11px;" align="center">
<a class="page_login_text" href="/user_guestbook.php" title="Min gästbok">
<span class="count">1</span>
</td>
<td style="padding-left: 11px;" align="center">
<a class="page_login_text" href="/forum.php?view=3" title="Du har 1 ny forumkommentar">
<span class="count">1</span>
</td>
<td style="padding-left: 11px;" align="center">
<a class="page_login_text" href="/user_images.php?user_id=162005&func=display_new_comments" title="Du har 1 ny albumkommentar">
<span class="count">1</span>
</td>
<td style="padding-left: 11px;" align="center">
<a class="page_login_text" href="/forum_favorites.php" title="Du har 2 uppdaterade trådar i "bevakade trådar"">
<span class="count">2</span>
</td>
I'm hesitant to post this, because it seems like I'm asking for a lot, but could someone please help me with a way to parse this in Python? I've been pulling my hair trying to do this, but regular expressions and I just don't match (pardon the pun). I've spent the last couple of hours experimenting and reading the Python manual on regular expressions, but I can't seem to figure it out.
Just to make it clear, what I need are 7 different expressions for matching the number within <span class="count"></span>. I need to, for example, be able to find the number of unread PMs ("olästa pm").
You will not parse html yourself. You will use a html parser built in python to parse the html.
Lightweight xml dom parser in python
Beautiful Soup
You can user lxml to pull out the values you are looking for pretty easily with xpaths
lxml
xpath
Example
from lxml import html
page = html.fromstring(open("um.htm", "r").read())
matches = page.xpath("//a[contains(#title, 'pm.') or contains(#title, 'ol')]/span")
print [elem.text for elem in matches]
use either:
BeautifulSoup
lxml
parsing HTML with regexes is a recipe for disaster.
It is impossible to reliably match HTML using regular expressions. It is usually possible to cobble something together that works for a specific page, but it is not advisable as even a subtle tweak to the source HTML can render all your work useless. HTML simply has a more complex structure than Regex is capable of describing.
The proper solution is to use a dedicated HTML parser. Note that even XML parsers won't do what you need, not reliably anyway. Valid XHTML is valid XML, but even valid HTML is not, even though it's quite similar. And valid HTML/XHTML is nearly impossible to find in the wild anyway.
There are a few different HTML parsers available:
BeautifulSoup is not in the standard library, but it is the most forgiving parser, it can handle almost all real-world HTML and it's designed to do exactly what you're trying to do.
HTMLParser is included in the Python standard library, but it is fairly strict about accepting only valid HTML.
htmllib is also in the standard library, but is deprecated.
As other people have suggested, BeautifulSoup is almost certainly your best choice.

Categories