parsing ajax response html/xml with lxml changes < > charcters to - python

I`m trying to parse in Python a webpage, a ajax response which basically looks like this
xml:
<table class="tab02">
<tr>
<th>Skrót</th>
<th>Pełna nazwa</th>
</tr>
<tr>
<td>1AT</td>
<td>ATAL SPÓŁKA AKCYJNA</td>
</tr>
</table>
Link: http://www.gpw.pl/ajaxindex.php?action=GPWCompanySearch&start=listForLetter&letter=A&listTemplateName=GPWCompanySearch%2FajaxList_PL
If I provide this code in python file as variable with use of simple code & lxml library (see below) I successfully parse everything, and whole result is well formated:
from lxml import etree
root = etree.fromstring(xml)
print etree.tounicode(root) # print etree.tostring(root)
Problem happens while parsing data from webpage (see example code below)
magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
root = etree.parse(link2page, magical_parser)
print etree.tounicode(root)
In result all characters < > from table are changed to < and >
<response>
<html>
<table class="tab02">
<tr>
<th>Skrót</th>
<th>Pełna nazwa</th>
</tr>
etc.
I`ve tried also with first treating link with urlib, with parsing it as html but i fail all the time. Can anyone provide me a hint please?

Related

Python - XPath issue while scraping the IMDb Website

I am trying to scrape the movies on IMDb using Python and I can get data about all the important aspects but the actors names.
Here is a sample URL that I am working on:
https://www.imdb.com/title/tt0106464/
Using the "Inspect" browser functionality I found the XPath that relates to all actors names, but when it comes to run the code on Python, it looks like the XPath is not valid (does not return anything).
Here is a simple version of the code I am using:
import requests
from lxml import html
movie_to_scrape = "https://www.imdb.com/title/tt0106464"
timeout_time = 5
IMDb_html = requests.get(movie_to_scrape, timeout=timeout_time)
doc = html.fromstring(IMDb_html.text)
actors = doc.xpath('//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text()')
print(actors)
I tried to change the XPath many times trying to make it more generic and then more specific, but it still does not return anything
Don't blindly accept the markup structure you see using inspect element.
Browser are very lenient and will try to fix any markup issue in the source.
With that being said, if you check the source using view source you can see that the table you're tying to scrape has no <tbody> as they are inserted by the browser.
So if you removed it form here
//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text() -> //table[#class="cast_list"]//tr//td[not(contains(#class,"primary_photo"))]//a/text()
your query should work.
From looking at the HTML start with a simple xpath like //td[#class="primary_photo"]
<table class="cast_list">
<tr><td colspan="4" class="castlist_label">Cast overview, first billed only:</td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000418/?ref_=tt_cl_i1"
><img height="44" width="32" alt="Danny Glover" title="Danny Glover" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" class="loadlate hidden " loadlate="https://m.media-amazon.com/images/M/MV5BMTI4ODM2MzQwN15BMl5BanBnXkFtZTcwMjY2OTI5MQ##._V1_UY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td>
PYTHON:
for photo in doc.xpath('//td[#class="primary_photo"]'):
print photo

Python and BeautifulSoup4: finding certain text from the tables and parsing the very next table

I'm facing quite a tricky problem while trying to fetch some data with BeautifulSoup.
I'd like to find all the tables that have certain text in them (in my example code 'Name:', 'City:' and 'Address:') and parse the text that is located in the very next table in the source code.
Page source code:
...
...
<td>Name:</td>
<td>John</td>
...
<td>City:</td>
<td>London</td>
...
<td>Address:</td>
<td>Bowling Alley 123</td>
...
...
I'd like to parse: "John", "London", "Bowling Alley 123"
Sorry I don't have any python code here to show my past effort, but it's because I've no idea where to start. Thanks!
This is clunky, but depending on how your TD's are wrapped and how consistent your TD targets are, you should be able to find them, iterate through them and use findNextSibling() to get your data:
from BeautifulSoup import BeautifulSoup
html = """\
<table>
<tr>
<td>Name:</td>
<td>John</td>
</tr>
<tr>
<td>City:</td>
<td>London</td>
</tr>
<tr>
<td>Address:</td>
<td>Bowling Alley 123</td>
</tr>
</table>
"""
targets=["City:","Address:","Name:"]
soup = BeautifulSoup(html)
for tr in soup.findAll("tr"):
for td in tr.findAll("td"):
if td.text in targets:
print td.findNextSibling().text
Bottom line, as long as you've got some sane/normal elements containing your TD's, using the NextSibling functions should get you where you're going.
Whether this works properly is dependent on whether the HTML is properly formed, but will likely work even if there are extraneous newlines or other text.
import bs4
def parseCAN(html):
b = bs4.BeautifulSoup(html)
matches = ('City:', 'Address:', 'Name:')
found = []
elements = b.findAll('td')
for n, e in enumerate(elements):
if e.text not in matches:
continue
if n < len(elements) - 1:
found.append(elements[n+1].text)
return found

Can't get a regex pattern to work in Python

I have the following (repeating) HTML text from which I need to extract some values using Python and regular expressions.
<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>
I can get the first value by using
match_det = re.compile(r'<td width="35.+?">(.+?)</td>').findall(html_source_det)
But the above is on one line. However, I also need to get the second value which is on the line following the first one but I cannot get it to work. I have tried the following, but I won't get a match
match_det = re.compile('<td width="35.+?">(.+?)</td>\n'
'<td width="65.+?value="(.+?)"></td>').findall(html_source_det)
Perhaps I am unable to get it to work since the text is multiline, but I added "\n" at the end of the first line, so I thought this would resolve it but it did not.
What I am doing wrong?
The html_source is retrieved downloading it (it is not a static HTML file like outlined above - I only put it here so you could see the text). Maybe this is not the best way in getting the source.
I am obtaining the html_source like this:
new_url = "https://webaccess.site.int/curracc/" + url_details #not a real url
myresponse_det = urllib2.urlopen(new_url)
html_source_det = myresponse_det.read()
Please do not try to parse HTML with regex, as it is not regular. Instead use an HTML parsing library like BeautifulSoup. It will make your life a lot easier! Here is an example with BeautifulSoup:
from bs4 import BeautifulSoup
html = '''<tr>
<td width="35%">Demand No</td>
<td width="65%"><input type="text" name="T1" size="12" onFocus="this.blur()" value="876716001"></td>
</tr>'''
soup = BeautifulSoup(html)
print soup.find('td', attrs={'width': '65%'}).findNext('input')['value']
Or more simply:
print soup.find('input', attrs={'name': 'T1'})['value']

How can I get the first and third td from a table with BeautifulSoup?

I am currently using Python and BeautifulSoup to scrape some website data.
I'm trying to pull cells from a table which is formatted like so:
<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>
The problem with the above HTML is that BeautifulSoup reads it as one tag. I need to pull the values from the first <td> and the third <td>, which would be 1 and 20, respectively.
Unfortunately, I have no idea how to go about this. How can I get BeautifulSoup to read the 1st and 3rd <td> tags of each row of the table?
Update:
I figured out the problem. I was using html.parser instead of the default for BeautifulSoup. Once I switched to the default the problems went away. Also I used the method listed in the answer.
I also found out that the different parsers are very temperamental with broken code. For instance, the default parser refused to read past row 192, but html5lib got the job done.So try using lxml, html, and also html5lib if you are having problems parsing the entire table.
That's a nasty piece of HTML you've got there. If we ignore the semantics of table rows and table cells for a moment and treat it as pure XML, its structure looks like this:
<tr>
<td>1
<td>
<td>20
<td>5%</td>
</td>
</td>
</td>
</tr>
BeautifulSoup, however, knows about the semantics of HTML tables, and instead parses it like this:
<tr>
<td>1 <!-- an IMPLICITLY (no closing tag) closed td element -->
<td> <!-- as above -->
<td>20 <!-- as above -->
<td>5%</td> <!-- an EXPLICITLY closed td element -->
</td> <!-- an error; ignore this -->
</td> <!-- as above -->
</td> <!-- as above -->
</tr>
... so that, as you say, 1 and 20 are in the first and third td elements (not tags) respectively.
You can actually get at the contents of these td elements like this:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<tr><td>1<td><td>20<td>5%</td></td></td></td></tr>")
>>> tr = soup.find("tr")
>>> tr
<tr><td>1</td><td></td><td>20</td><td>5%</td></tr>
>>> td_list = tr.find_all("td")
>>> td_list
[<td>1</td>, <td></td>, <td>20</td>, <td>5%</td>]
>>> td_list[0] # Python starts counting list items from 0, not 1
<td>1</td>
>>> td_list[0].text
'1'
>>> td_list[2].text
'20'
>>> td_list[3].text
'5%'

fetching information with scrapy(Python)

when I want to capture the following information:
<td>But<200g/M2</td>
name = fila.select('.//td[2]/text()').extract()
I capture the following
"But"
apparently there is a conflict with these characters "< /"
escape special characters with a '\', so :
But\<200g\/M2
note that creating a file with those characters wouldn't be so easy
Here is an approach that uses BeautifulSoup, in case you have more luck with a different library:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<html><head><title>StackOverflow-Question</title></head><body>
<table>
<tr>
<td>Ifs</td>
<td>Ands</td>
<td>But<200g/M2</td>
</tr>
</table>
</body></html>""")
print soup.find_all('td')[2].get_text()
The output of this is:
But<200g/M2
If you wanted to use XPath you could also use The ElementTree XML API. Here I'm using BeautifulSoup to take HTML and convert it to valid XML so I can run an XPath query against it:
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
html = """<html><head><title>StackOverflow-Question</title></head><body>
<table>
<tr>
<td>Ifs / Ands / Or</td>
<td>But<200g/M2</td>
</tr>
</table>
</body></html>"""
soup = BeautifulSoup(html)
root = ET.fromstring(soup.prettify())
print root.findall('.//td[2]')[0].text
The output of this is the same (note that the HTML is slightly different, this is because XPath arrays start at one while Python arrays start at 0).

Categories