Python - XPath issue while scraping the IMDb Website - python

I am trying to scrape the movies on IMDb using Python and I can get data about all the important aspects but the actors names.
Here is a sample URL that I am working on:
https://www.imdb.com/title/tt0106464/
Using the "Inspect" browser functionality I found the XPath that relates to all actors names, but when it comes to run the code on Python, it looks like the XPath is not valid (does not return anything).
Here is a simple version of the code I am using:
import requests
from lxml import html
movie_to_scrape = "https://www.imdb.com/title/tt0106464"
timeout_time = 5
IMDb_html = requests.get(movie_to_scrape, timeout=timeout_time)
doc = html.fromstring(IMDb_html.text)
actors = doc.xpath('//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text()')
print(actors)
I tried to change the XPath many times trying to make it more generic and then more specific, but it still does not return anything

Don't blindly accept the markup structure you see using inspect element.
Browser are very lenient and will try to fix any markup issue in the source.
With that being said, if you check the source using view source you can see that the table you're tying to scrape has no <tbody> as they are inserted by the browser.
So if you removed it form here
//table[#class="cast_list"]//tbody//tr//td[not(contains(#class,"primary_photo"))]//a/text() -> //table[#class="cast_list"]//tr//td[not(contains(#class,"primary_photo"))]//a/text()
your query should work.

From looking at the HTML start with a simple xpath like //td[#class="primary_photo"]
<table class="cast_list">
<tr><td colspan="4" class="castlist_label">Cast overview, first billed only:</td></tr>
<tr class="odd">
<td class="primary_photo">
<a href="/name/nm0000418/?ref_=tt_cl_i1"
><img height="44" width="32" alt="Danny Glover" title="Danny Glover" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" class="loadlate hidden " loadlate="https://m.media-amazon.com/images/M/MV5BMTI4ODM2MzQwN15BMl5BanBnXkFtZTcwMjY2OTI5MQ##._V1_UY44_CR1,0,32,44_AL_.jpg" /></a> </td>
<td>
PYTHON:
for photo in doc.xpath('//td[#class="primary_photo"]'):
print photo

Related

BeautifulSoup4 not able to scrape data from this table

Sorry for this silly question as I'm new to web scraping and have no knowledge about HTML etc.
I'm trying to scrape data from this website. Specifically, from this part/table of the page:
末"四"位数 9775,2275,4775,7275
末"五"位数 03881,23881,43881,63881,83881,16913,66913
末"六"位数 313110,563110,813110,063110
末"七"位数 4210962,9210962,9785582
末"八"位数 63262036
末"九"位数 080876872
I'm sorry that's in Chinese and it looks terrible since I can't embed the picture. However, The table is roughly in the middle(40 percentile from the top) of the page. The table id is 'tr_zqh'.
Here is my source code:
import bs4 as bs
import urllib.request
def scrapezqh(url):
source = urllib.request.urlopen(url).read()
page = bs.BeautifulSoup(source, 'html.parser')
print(page)
url = 'http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1'
print(scrapezqh(url))
It scrapes most of the table but the part that I'm interested in. Here is a part of what it returns where I think the data should be:
<td class="tdcolor">网下有效申购股数(万股)
</td>
<td class="tdwidth" id="td_wxyxsggs"> 
</td>
</tr>
<tr id="tr_zqh">
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>
<td class="tdcolor">中签号公布日期
</td>
<td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
I'd like to get the content of this table: tr id="tr_zqh" (the 6th row above). However for some reason it doesn't scrape its data(No content below). However, when I check the source code of the webpage, the data are in the table. I don't think it is a dynamic table which BeautifulSoup4 can't handle. I've tried both lxml and html parser and I've tried pandas.read_html. It returned the same results. I'd like to get some help to understand why it doesn't get the data and how I can fix it. Many thanks!
Forgot to mention that I tried page.find('tr'), it returned a part of the table but not the lines I'm interested. Page.find('tr') returns the 1st line of the screenshot. I want to get the data of the 2nd & 3rd line(highlighted in the screenshot)
If you extract a couple of variables from the initial page you can use themto make a request to the api directly. Then you get a json object which you can use to get the data.
import requests
import re
import json
from pprint import pprint
s = requests.session()
r = s.get('http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1')
gdpm = re.search('var gpdm = \'(.*)\'', r.text).group(1)
token = re.search('http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get\?type=XGSG_ZQH&token=(.*)&st=', r.text).group(1)
url = "http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?type=XGSG_ZQH&token=" + token + "&st=LASTFIGURETYPE&sr=1&filter=%28securitycode='" + gdpm + "'%29&js=var%20zqh=%28x%29"
r = s.get(url)
j = json.loads(r.text[8:])
for i in range (len(j)):
print ( j[i]['LOTNUM'])
#pprint(j)
Outputs:
9775,2275,4775,7275
03881,23881,43881,63881,83881,16913,66913
313110,563110,813110,063110
4210962,9210962,9785582
63262036
080876872
From where I look at things your question isn't clear to me. But here's what I did.
I do a lot of webscraping so I just made a package to get me beautiful soup objects of any webpage. Package is here.
So my answer depends on that. But you can take a look at the sourcecode and see that there's really nothing esoteric about it. You may drag out the soup-making part and use as you wish.
Here we go.
pip install pywebber --upgrade
from pywebber import PageRipper
page = PageRipper(url='http://data.eastmoney.com/xg/xg/detail/300741.html?tr_zqh=1', parser='html5lib')
page_soup = page.soup
tr_zqh_table = page_soup.find('tr', id='tr_zqh')
from here you can do tr_zqh_table.find_all('td')
tr_zqh_table.find_all('td')
Output
[
<td class="tdtitle" id="td_zqhrowspan">中签号
</td>, <td class="tdcolor">中签号公布日期
</td>, <td class="ltxt" colspan="3"> 2018-02-22 (周四)
</td>
]
Going a bit further
for td in tr_zqh_table.find_all('td'):
print(td.contents)
Output
['中签号\n ']
['中签号公布日期\n ']
['\xa02018-02-22 (周四)\n ']

Opening and Saving a Web Page in Internet Explorer through Python

I have written a python script to repeatedly hit a site using selenium and scrape a table of interest. I am not super well versed in web development. I did look through plenty of articles on urllib, requests, etc but am using selenium because this site uses a single sign on type authentication and it is well above my skill level to try and replicate that login through python (I think...). I can deal with the longer processing time and extra effort of selenium as long as I get the data in the end. Once I automate the clicks with selenium to get there I save the page source and parse it with beautifulsoup.
The page HTML looks something like this (I have obfuscated it a bit as it is a corporate website):
<TABLE><TBODY>
<TR>
<TD><B>abc<B></B></B></TD></TR>
<TR>
<TH title="Source - abc " class=tableHeader>
<DIV class=usageHeaderWithoutRefinements>abc </DIV></TH>
<TH title="Source - abc " class=tableHeader>
<DIV class=usageHeaderWithoutRefinements>abc </DIV></TH>
<TH title="Source - abc " class=tableHeader>
<DIV class=usageHeaderWithoutRefinements>abc </DIV></TH></TR>
<TR>
<TD title="Source - abc " class=tableData><B>1234</B> </TD>
<TD title="Source - abc " class=tableData><B>1234</B> </TD>
</TBODY></TABLE>
In the browser it is rendered as a table where the numbers like 1234 are hyperlinks to another page. I just want to store the numbers in the tables. Weirdly, it seems like sometimes when I run code like:
info_1 = soup.findAll('table')[0]
data_rows = part_info_1.findAll('tr')
data_1 = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))]
Sometimes it returns the table information but sometimes it returns an empty list. It seems like the Javascript is the problem.
My question: Is there a way to add to the python code above to pull out the info in the tags without having to actually run the javascript or do anything similarly complicated? I'm hoping it is easy since the number is right there in the HTML but I mostly copied that code from another request for help and I'm not super comfortable extending so just asking. Alternatively, is there something simple I could do through python/selenium to make it so that when I use:
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
I can be sure that I always get the right table and not an empty list?
Thanks in advance.

Python Requests gives different page text than Internet Explorer

Looking at my stackoverflow user profile page: https://stackoverflow.com/users/2683104/roberto
The site indicates I have been a member for 316 days (screenshots at end of post). If I view source in my browser (IE11), I can see this data comes from a days-visited class.
But if I look for this same days-visited information using Python Requests, the data does not appear anywhere. Why?
from requests import Session
from BeautifulSoup import BeautifulSoup
s = Session()
url = 'https://stackoverflow.com/users/2683104/roberto'
page = s.get(url)
soup = BeautifulSoup(page.text)
print soup.prettify() #server response, prettified
# following returns error
# AttributeError: 'NoneType' object has no attribute 'getText'
#days_visited = soup.find('span', attrs={'id':'days-visited'}).getText()
s.close()
screenshot
view source
python Requests
That field is not visible to your script (or other users). If you want to scrap that piece of information, you will need to have your script login and store appropriate cookies.
This is what is seen by users that aren't you:
And the code block they see:
<tbody>
<tr>
<th>visits</th>
<td>member for</td>
<td class="cool" title="2013-08-14 15:38:01Z">11 months</td>
</tr>
<tr>
<th></th>
<td>seen</td>
<td class="supernova" title="2014-08-08 05:26:50Z">
<span title="2014-08-08 05:26:50Z" class="relativetime">6 mins ago</span>
</td>
</tr>
</tbody>
Normally, I'd recommend against scraping Stack Overflow for data and use the API instead, but this particular piece of information isn't returned as part of the User object.
As the comments said, 'days-visited' only shows when you are logged-in. And it can be seen only by the member himself.
You may find you cookies in your browser and use cookies in you request.
http://docs.python-requests.org/en/latest/user/quickstart/#cookies

How to scrape a website with table content that is retrieved by javascript?

I want to scrape a table from a website with a table that looks like this;
<table class="table table-hover data-table sort display">
<thead>
<tr>
<th class="Column1">
</th>
<th class="Column2">
</th>
</tr>
</thead>
<tbody>
<tr ng-repeat="item in filteredList | orderBy:columnToOrder:reverse">
<td>{{item.Col1}}</td>
<td>{{item.Col2}}</td>
</tr>
</tbody>
</table>
It seems that this website is built using some javascript framework that retrieves the table content from the backend through web services.
The problem is how can we scrape table data if the data is not in numerical format? The code above have the content enclosed in {{ }}. Does this make the website unscrapable? Any solution? Thank you.
I am using python and beautifulsoup4.
Usually when there is JS content BeautifulSoup is not the tool. I use selenium. Try this and see if the HTML you are getting is scrapable:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load
# now print the response
print driver.page_source
At this point, you can use BeautifulSoup to scrape the data out of driver.page_source. Note: you will need to install selenium and Firefox
You could try using import.io (https://import.io) - our connectors, extractors and crawlers all support getting data from pages that is rendered with JavaScript. Without a specific URL I can't verify yours will work for certain, but I don't see why it wouldn't (looks like it is being rendered by AngularJS which should be fine).
p.s. if you hadn't figured it out, I work at import.io - drop me a line if you have specific questions.
What you could do is go to Chrome, and load the site. Go to the console and go to the 'network' tab. Tick 'preserve log' at the top. Reload site and load all the stuff in the log. Now you'll see where the data comes from for 'filteredList' on your page. So in your scraper you now also know where that data comes from, so you can include it in your scraper. The data is most likely in json format... which can be picked up and fiddled with to your hearts content....

Printing certain HTML Python Mechanize

Im making a small python script for auto logon to a website. But i'm stuck.
I'm looking to print into terminal a small part of the html, located within this tag in the html file on the site:
<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>
But how do I extract and print just the name, John Appleseed?
I'm using Pythons' Mechanize on a mac, by the way.
Mechanize is only good for fetching the html. Once you want to extract information from the html, you could use for example BeautifulSoup. (See also my answer to a similar question: Web mining or scraping or crawling? What tool/library should I use?)
Depending on where the <td> is located in the html (it's unclear from your question), you could use the following code:
html = ... # this is the html you've fetched
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element
As you have not provided the full HTML of the page, the only option right now is either using string.find() or regular expressions.
But, the standard way of finding this is using xpath. See this question: How to use Xpath in Python?
You can obtain the xpath for an element using "inspect element" feature of firefox.
For ex, if you want to find the XPATH for username in stackoverflow site.
Open firefox and login to the website & RIght-click on username(shadyabhi in my case) and select Inspect Element.
Keep your mouse over tag or right click it and "Copy xpath".
You can use a parser to extract any information in a document. I suggest you to use lxml module.
Here you have an example:
from lxml import etree
from StringIO import StringIO
parser = etree.HTMLParser()
tree = etree.parse(StringIO("""<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>"""),parser)
>>> tree.xpath("string()").strip()
u'John Appleseed'
More information about lxml here

Categories