Scrape a form on incorrect web page - python

I'm trying to scrape a html form using robobrowser with python 3.4. I use the default html parser:
self._browser = RoboBrowser(history=True, parser="html.parser")
It works fine for correct web pages but now I have to parse incorrectly written page. Here is the html fragment:
<form method="post" action="decide.php?act=submit_advance">
<table class="td_advanced">
<tr class="td_advance">
<td colspan="4" class="td_advance"></strong><br></td>
<td colspan="3" class="td_left">Case sensitive:<br><br></td>
<td><input type="checkbox" name="case_sensitive" /><br><br></td>
[...]
</form>
The closing strong tag is incorrect. This error prevents the parser from read all inputs following this incorrect tag:
form = self._browser.get_form()
print(form)
>>> <RoboForm>
Any suggestions?

I have found the solution myself. The comment about beautifulsoup was helpful and took my search to a proper way.
The solution is : use another html parser. I tried with lxml and it works for me.
self._browser = RoboBrowser(history=True, parser="lxml")
As PyPI doesn't currently have lxml installer working with my python version, I downloaded it from here: http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

Related

Grabbing the value of a HTML form input field?

There's a web page similar to: www.example.com/form.php
I want to use Python to grab one of the values from the HTML form on the page. For example, if the form had I could get the value "test" returned
I have googled this extensively but most relate to posting form data, or have advice to use Django or cgi-bin. I don't have access to the server directly so I can't do that.
I thought the library REQUESTS could do it but I can't see it in the documentation.
HTML:
<html>
<body>
<form method="" action="formpost.php" name="form1" id="form1">
<input type="text" name"field1" value="this is field1">
<input type="hidden" name="key" value="secret key field">
</form>
</body>
As an example, I'd like something like this in Python:
import special_library
html = special_library.get("http://www.example.com/form.php")
print html.get_field("wanted")
Has anyone got any suggestions to achieve this? Or any libraries I may not have thought of or been aware of?
You can use requests library, and lxml
Try this:
import requests
from lxml import html
s = requests.Session()
resp = s.get("http://www.example.com/form.php")
doc = html.fromstring(resp.text)
wanted_value = doc.xpath("//input[#class='wanted_class_name']/#value")
print(wanted_value)
You can check following resources:
requests
xpath

Python XPath keeps returning empty list

XPath via lxml in Python has been making me run in circles. I can't get it to extract text from an HTML table despite having what I believe to be the correct XPath. I'm using Chrome to inspect and extract the XPath, then using it in my code.
Here is the HTML table taken directly from the page:
<div id="vehicle-detail-model-specs-container">
<table id="vehicle-detail-model-specs" class="table table-striped vdp-feature-table">
<!-- Price -->
<tr>
<td><strong>Price:</strong></td>
<td>
<strong id="vehicle-detail-price" itemprop="price">$ 2,210.00</strong> </td>
</tr>
<!-- VIN -->
<tr><td><strong>VIN</strong></td><td> *0343</td></tr>
<!-- MILEAGE -->
<tr><td><strong>Mileage</strong></td><td>0 mi</td></tr>
</table>
I'm trying to extract the Mileage. The XPath I'm using is:
//*[#id="vehicle-detail-model-specs"]/tbody/tr[3]/td[2]
And the Python code that I'm using is:
page = requests.get(URL)
tree = html.fromstring(page.content)
mileage = tree.xpath('//*[#id="vehicle-detail-model-specs"]/tbody/tr[3]/td[2]')
print mileage
Note: I've tried adding /text() to the end and I still get nothing back, just an empty list [].
What am I doing wrong and why am I not able to extract the table value from the above examples?
As Amber has pointed out, you should omit the tbody part.
You use tbody in your xpath when there is no <tbody> tag in the html code for your table.
Using the html you posted, I am able to extract the mileage value with the following xpath:
tree.xpath('//*[#id="vehicle-detail-model-specs"]/tr[3]/td[2]')[0].text_content()

BeautifulSoup Parses Table Incorrectly

Having trouble getting Beautiful Soup to process a large table of play-by-play basketball data properly. Code:
import urllib.request
from bs4 import BeautifulSoup
request = urllib.request.Request('http://www.basketball-reference.com/boxscores/pbp/201611220LAL.html')
result = urllib.request.urlopen(request)
resulttext = result.read()
soup = BeautifulSoup(resulttext, "html.parser")
pbpTable = soup.find('table', id="pbp")
If you run this example yourself, you will find that the table is not fully parsed- all we get out is this:
<table class="suppress_all sortable stats_table" data-cols-to-freeze="1" id="pbp">
<caption>Play-By-Play Table</caption>
<tr class="thead" id="q1">
<th colspan="6">1st Q</th></tr></table>
The problem is in the parsing itself printing the soup variable gives (among other things)
</div>
<div class="table_wrapper" id="all_pbp">
<div class="section_heading">
<span class="section_anchor" data-label="Play-By-Play" id="pbp_link"></span>
<h2>Play-By-Play</h2> <div class="section_heading_text">
<ul> <li>  Jump to: 1st | 2nd | 3rd | 4th <br> <span class="bbr-play-score key">scoring play</span> <span class="bbr-play-tie key">tie</span> <span class="bbr-play-leadchange key">lead change</span></br></li>
</ul>
</div>
</div> <div class="table_outer_container">
<div class="overthrow table_container" id="div_pbp">
<table class="suppress_all sortable stats_table" data-cols-to-freeze="1" id="pbp"><caption>Play-By-Play Table</caption><tr class="thead" id="q1">
<th colspan="6">1st Q</th></tr></table></div></div></div></div></div></body></html>
Most importantly, a /table tag appears out of nowhere. Viewing the page source of the relevant link we can see that the table is not closed there- it goes on for a while. Is there any fix for this besides implementing my own HTML parsing code?
Use "lxml" or "html5lib" instead of "html.parser" in
soup = BeautifulSoup(resulttext, "lxml")`
and you get more data.
But you may have to install lxml or html5lib if you don't have yet.
pip install lxml
pip install html5lib
lxml may need C/C++ compiler, libxml library (libxml.dll on Windows), etc.

Unable to execute python web scraping script successfully after user submits a form on a website built with Flask from the second time onwards

Using Flask and Python, I have a website running on localhost which allows user to select a specific month to download a report for. Based on the selected month, I will than have my web scraping file imported which retrieves the data from another website (requires login). My web scraping script uses Mechanize.
Here is the portion of code where my web scraping file (webscrape.py) is imported after the download button is clicked (the selection is done on office.html):
#app.route('/office/', methods=['GET','POST'])
def office():
form=reportDownload()
if request.method=='POST':
import webscrape
return render_template('office.html', success=True)
elif request.method=='GET':
return render_template('office.html', form=form)
In the render_template method,success=True is passed as an argument so that my office.html script will display a success message, if not (when it is a GET request), it will display the form for user selection. Here is my script for office.html:
{% extends "layout.html" %}
{% block content %}
<h2>Office</h2>
{% if success %}
<p>Report was downloaded successfully!</p>
{% else %}
<form action="{{ url_for('office') }}" method="POST">
<table width="70%" align="center" cellpadding="20">
<tr>
<td align="right"><p>Download report for: </p></td>
<td align="center"><p>Location</p>
{{form.location}}</td>
<td align="center"><p>Month</p>
{{form.month}} </td>
<td align="center"><p>Year</p>
{{form.year}} </td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td align="center">{{form.submit}} </td>
</tr>
</table>
</form>
{% endif %}
{% endblock %}
The problem I have is when I want to do further downloads, i.e. after downloading for the first time, I go back to the office page and download a report again. On the second try, the success message gets displayed but nothing gets downloaded.
In my web scraping script, using mechanize and cookiejar, I have this few lines of code in the beginning:
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
and I proceed with the web scraping.
When running the web scraping file on my Terminal (or command prompt), the script executes without any problems, even if I run it a second or third time. So I think that it may be a problem with the website codes.
Any suggestions will be appreciated! I have tried different ways of resolving the problem such as using return redirect instead, or trying to clear the cookies in cookiejar. None has worked so far, or I may be using the methods wrongly.
Thank you in advance!
Once your Flask app is started it only imports each package once. That means that when it runs into import webscrape for the second time it says “well, I already imported that earlier, so no need to take further action…” and moves on to the next line, rendering the template without actually starting the script.
In that sense import in Python is not the same as require for other languages (such as PHP; by the way, it would be closer to require_once in PHP).
The solution would be to make your scraper an object (class) and instantiate it each time you need it. Then you move the import to the top of the file and inside the if request.method=='POST' you just create a new instance of your web scraper.

A href catching

I'm using BeautifulSoup for parsing some html. Here is the content:
<tr>
<th>Your provider:</th>
<td>
<img src="/isp_logos/la-la-la.ico" alt=""/>
<a href="/isp/SomeProvider">
Provider name </a>
<a href="http://*/isp-comparer/?isp=000000">
</a>
</td>
</tr>
I have to get SomeProvider text from the link . My code is:
contentSoup = BeautifulSoup(ThatHtml)
print contentSoup.findAll('a', href=re.compile('/isp/(.*)'))
The result is empty array, why? Maybe there are another ways?
With your posted code and input, I'm getting:
[ Provider name ]
As the return of the array. Are you using the newest 3.1.x version of BeautifulSoup? I actually had the same problem, but it turns out I downloaded the 2.x version of BeautifulSoup thinking that the 2.x meant it was compatible with python 2.x.
Assuming that the first contains the SomeProvider, you could just use:
contentSoup.a
to extract that tag.

Categories