Web Scraping with Python Request/lxml: Getting data from ul/li - python

so I'm pretty new to this, and I haven't been able to find anything on google on this question.
I'm using request and lxml with Python, I've seen that there's a lot of different modules for web scraping, but is there any reason to choose one over the other? Can you do the same stuff with requests/lxml as you can with for example BeautifulSoup?
Anyway, here's my actual question;
This is my code:
import requests
from lxml import html
# Login data
inputUrl = 'http://forum.mytestsite.com/login'
usr = 'myusername'
pwd = 'mypassword'
payload = dict(login=usr, password=pwd)
# Open session
with requests.Session() as s:
# Login
s.post(inputUrl, data=payload)
# Get page data
pageResult = s.get('http://forum.mytestsite.com/icons/', allow_redirects=False)
pageResult = html.fromstring(pageResult.content)
pageIcons = pageResult.xpath('//script[#id="table-icons"]/text()')
print pageIcons[0]
The result when printing pageIcons[0]:
<ul id="icons">
{{#each icons}}
<li data-handle="{{handle}}">
<img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
</li>
{{/each}}
</ul>
This is the website/js code that generates the icons:
<script id="table-icons" type="text/x-handlebars-template">
<ul id="icons">
{{#each icons}}
<li data-handle="{{handle}}">
<img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
</li>
{{/each}}
</ul>
</script>
And here's the result on the page:
<ul id="icons">
<li data-handle="558FSTBI" class="">
<img src="http://testsite.com/icons/558FSTBI.1.png" alt="Icon 1" title="Icon 1">
</li>
<li data-handle="310AYTZI">
<img src="http://testsite.com/icons/310AYTZI.1.png" alt="Icon 2" title="Icon 2">
</li>
<li data-handle="669PQXBI" class="">
<img src="http://testsite.com/icons/669PQXBI.1.png" alt="Icon 3" title="Icon 3">
</li>
</ul>
My goal:
What I would like to do is to retrieve all of li data-handles, but I haven't been able to figure out how to retrieve this data. So my goal is to retrieve all of the icon paths and their titles, could anyone help me out here? I'd really appreciate any help :)

You aren't parsing the li or ul.
Start with this
//ul[#id='icons']/li/img
And from those elements, you can extract the individual information
Regarding the first question, beautifulsoup optionally uses lxml. If you don't think you need it, and are comfortable with XPath, don't worry about it.
However, since it's Javascript generating the page, you need a headless browser rather than requests library.
Get page generated with Javascript in Python
Reading dynamically generated web pages using python

Related

Tag <li> not showing when using BeautifulSoup in Python

I was learning web scraping, and the 'li' tag is not showing when I run soup.findAll
Here's the html:
<label>
<input type="checkbox">
<ul class="dropdown-content">
<li>
<a href=stuff</a>
</li>
</ul>
</label>
I tried:
soup = BeautifulSoup(r.content,'html5lib')
dropdown = soup.findAll('ul', {'class':'dropdown-content'})
print(dropdown)
And it only shows:
[<ul class="dropdown-content"></ul>]
Any help will do. Thanks!
in this command: dropdown = soup.findAll('ul', {'class':'dropdown-content'}), yo search for ul and dropdown-content class.
dropdown = soup.find('ul').findAll('li')
Your selection per se is okay to find the <ul> it may do not contain any <li> cause I assume these elements are generated dynamically by javascript. To validate this, question should be improved and url of website should be provided.
If content is provided dynamically one approach could be to work with selenium that will render the website like a browser and could return the "full" dom.
Note: In new code use find_all() instead of old syntax findAll()
Example
Html in your example is broken, but your code works if any lis are in the ul in your soup.
import requests
from bs4 import BeautifulSoup
html = '''
<label>
<input type="checkbox">
<ul class="dropdown-content">
<li>
</li>
</ul>
</label>
'''
soup = BeautifulSoup(html,'html5lib')
dropdown = soup.find_all('ul', {'class':'dropdown-content'})
print(dropdown)
Output
[<ul class="dropdown-content">
<li>
</li>
</ul>]

How to follow pagination with scrapy

I have this target url:
<nav>
<ul class="pagination pagination-lg">
<li class="active" itemprop="pageStart">
1</li>
<li itemprop="pageEnd">
2</li>
<li>
<a href="moto-2.html" aria-label="Next" class="xh-highlight">
<span aria-hidden="true">ยป</span></a>
</li><
</ul>
</nav>
but I cant select the next page link, I try with:
next_page_url = response.xpath('./div/div/div[1]/nav/ul/li[3]/a').extract_first()
also with
response.css('[class="xh-highlight"]').extract()
I only get as result [] on the shell
other point: I set the user agent as google chrome because I read here about other user with problems on mark accents, but don't fix my problem
I want to warn you Scrapy cannot scrape website rendered with javascript. Consider using a web driver like Selenuim with scrapy if the page is rendered in javascript.
I would recommend you go to scrapy shell, and type view(response). If you see a blank page than the page is rendered in javascript.
This is how you get urls from xpath, but I doubt it will make a difference sence you see no object
next_page_url = response.xpath('nav/ul/li[3]/a/text()')

How to scrape tags that appear within a script

My intention is to scrape the names of the top-selling products on Ali-Express.
I'm using the Requests library alongside Beautiful Soup to accomplish this.
# Remember to import BeautifulSoup, requests and pprint
url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found
all_items = soup.find_all('li',class_= 'top10-item')
pp.pprint(all_items)
# []
However this returns an empty list, indicating that soup_find_all() did not find any tags fitting that criteria.
Inspect Element in Chrome displays the list items as such
.
However in source code (ul class = "top10-items") contains a script, which seems to iterate through each list item (I'm not familiar with HTML).
<div class="container">
<div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
<ul class="top10-items loading" id="bestselling-top10">
</ul>
<script class="X-template-top10" type="text/mustache-template">
{{#topList}}
<li class="top10-item">
<div class="rank-orders">
<span class="rank">{{rank}}</span>
<span class="orders">{{productOrderNum}}</span>
</div>
<div class="img-wrap">
<a href="{{productDetailUrl}}" target="_blank">
<img src="{{productImgUrl}}" alt="{{productName}}">
</a>
</div>
<a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
<p class="item-price">
<span class="price">US ${{productMinPrice}}</span>
<span class="uint">/ {{productUnitType}}</span>
</p>
</li>
{{/topList}}</script>
</div>
</div>
So this probably explains why soup.find_all() doesn't find the "li" tag.
My question is: How can I extract the item names from the script using Beautiful soup?

Input to a web page and retrieve the output

i am trying to enter a password into "howsecureismypassword" website and retrieve the output using webscraping..is this possible. here is the html from the web page and below is the code i have so far, any help would be appreciated.
<div role="main">
<div class="input">
<input type="password" id="password" ng-model="password" ng-change="passwordChange()" placeholder="Enter Password" class="password" autofocus>
</div>
<div class="phishing" ng-hide="password">
<p>This site could be stealing your password... it's not, but it easily <em>could</em> be.<br />Be careful where you type your password.</p>
</div>
<div ng-show="password">
<ul ng-show="display.config">
<li>
<label><input type="checkbox" ng-model="config.namedNumbers" ng-change="config.changeNamedNumbers()" />Use Named Numbers</label>
</li>
<li>
<label>Calculations per second</label>
<input type="text" ng-model="config.calculations" ng-change="config.changeCalculations()" />
</li>
</ul>
<div class="results">
<span class="toggle" ng-click="display.toggleConfig()">{{display.configText}}</span>
<p ng-hide="insecure">It would take <span ng-show="config.calculationsOriginal">a desktop PC</span> about <span class="main">{{time}}</span> to crack your password</p>
<a class="tweet-me" ng-hide="insecure" href="http://twitter.com/home/?status=It would take a desktop PC about {{time}} to crack my password!%0d%0dhttp://hsim.pw">[Tweet Result]</a>
<p ng-show="insecure">Your password would be cracked almost <span class="main">Instantly</span></p>
<a class="tweet-me" ng-show="insecure" href="http://twitter.com/home/?status=My password would be cracked almost instantly!%0d%0dhttp://hsim.pw">[Tweet Result]</a>
<span class="toggle" ng-click="display.toggleDetails()">{{display.detailsText}}</span>
</div>
# Program to access a web page using httplib2
from httplib2 import Http
from urllib.parse import urlencode
# Create a web object
h = Http()
# set the url of the webpage
url = 'https://howsecureismypassword.net/'
password=input('enter')
# Create a data dictionary
data = {'placeholder' : password}
# encode the dictionary
web_data = urlencode(data)
# Connect to the local web server 192.168.26.10
response, content = h.request(url, 'POST', web_data)
if response.status == 200:
# Display the contents of the web page returned
text_content = content.decode()
print('Contents:')
print(text_content)
# Turn it into "Beautiful Soup"
from bs4 import BeautifulSoup
soup = BeautifulSoup(content)
print(soup.get_text())
else:
print('Error accesing web page')
I would use splinter for this. BeautifulSoup is good for scraping, not for interacting with pages. Here is the example straight from the splinter:
from splinter import Browser
with Browser() as browser:
# Visit URL
url = "http://www.google.com"
browser.visit(url)
browser.fill('q', 'splinter - python acceptance testing for web applications')
# Find and click the 'search' button
button = browser.find_by_name('btnG')
# Interact with elements
button.click()
if browser.is_text_present('splinter.cobrateam.info'):
print "Yes, the official website was found!"
else:
print "No, it wasn't found... We need to improve our SEO techniques"
You can adapt something like that to suit your needs.

Python data scraping

I want to download a couple songs off of http://www.youtube-mp3.org/. I'm using urllib2 and BeautifulSoup.
The problem is that when I urllib2 open the site with my video ID plugged in, http://www.youtube-mp3.org/?c#v=lV7r8PiuecQ, I get the site but they are tricky about it and load the info after the initial pageload with some js ajax stuff. So when I try to scrape the url of the download link, literally isn't on the page because it hasn't been loaded.
Anyone know how I can maybe trigger this js loader in my python script, or something?
Here is the relevant empty html BEFORE the content that I want is loaded into it.
<div id="link_box" style="display:none">
<div id="link_box_title" style="font-weight:bold; text-decoration:underline">
</div>
<div class="row">
<div id="link_box_bb_code_title" style="font-weight:bold">
</div>
<input type="text" id="BBCodeLink" onclick="sAll(this)" />
</div>
<div class="row">
<div id="link_box_html_code_title" style="font-weight:bold">
</div>
<input type="text" id="HTMLLink" onclick="sAll(this)" />
</div>
<div class="row">
<div id="link_box_direct_code_title" style="font-weight:bold">
</div>
<input type="text" id="DirectLink" onclick="sAll(this)" />
</div>
</div>
<div id="v-ads">
</div>
<div id="dl_link">
</div>
<div id="progress">
</div>
<div id="loader">
<img src="ajax-loader-b.gif" alt="loading.." width="16" height="11" />
</div>
</div>
<div class="clear">
</div>
</div>
The API is JSON-based, so the contents of the html files won't give you any clue on where to find the files. A good idea when exploring web services like this one, is to open the Network tab in Chrome's developer tools and see what pages it loads when interacting with the page. That exercise showed me that two urls in particular seem interesting:
http://www.youtube-mp3.org/api/pushItem/?item=http%3A//www.youtube.com/watch%3Fv%3DKMU0tzLwhbE&xy=trve&r=1314700829128
http://www.youtube-mp3.org/api/itemInfo/?video_id=KMU0tzLwhbE&adloc=&r=1314700829314
The first url appears to be queuing a file for processing, the second to get the status of the processing job.
The second url takes a video_id GET parameter that is the id for the video on youtube (http://www.youtube.com/watch?v=KMU0tzLwhbE) and returns the status of the decoding job. The second and third seem irrelevant for this purpose which you can verify by test loading the url with and without the extra parameters.
The content of the page is:
info = { "title" : "Developers",
"image" : "http://i4.ytimg.com/vi/KMU0tzLwhbE/default.jpg",
"length" : "3", "status" : "serving", "progress_speed" : "",
"progress" : "", "ads" : "",
"h" : "a0aa17294103c638fa7f5e0606f839d3" };
Which happens to be JSON data. The interesting bit in this is "a0aa17294103c638fa7f5e0606f839d3" which looks like a hash that the web service use to refer to the decoded mp3 file. Also check out how the download link on the front page looks:
http://www.youtube-mp3.org/get?video_id=KMU0tzLwhbE&h=a0aa17294103c638fa7f5e0606f839d3
Now we have all the missing pieces of the puzzle together. First, we take the url of a youtube video (http://www.youtube.com/watch?v=iKP7DZmqdbU) url quote it and feed it to the api using this url:
http://www.youtube-mp3.org/api/pushItem/?item=http%3A//www.youtube.com/watch%3Fv%3DiKP7DZmqdbU&xy=trve
Then, wait a few moments until the decoding job is done:
http://www.youtube-mp3.org/api/itemInfo/?video_id=iKP7DZmqdbU
Take the hash found in the info url to construct the download url:
http://www.youtube-mp3.org/get?video_id=iKP7DZmqdbU&h=2e4b61b6ddc8bf83f5a0e4e4ee0635bb
Note that it is possible that the web master of the site does not want to be scraped and will take counter measures if people starts to (in the webmasters eyes) abuse the site. For example it seem to use referer protection so clicking the links in this post won't work, you have to copy them and load them in a new browser window.
Test code:
from re import findall
from time import sleep
from urllib import urlopen, quote
yt_code = 'gijypDkEqUA'
yt_url = 'http://www.youtube.com/watch?v=%s' % yt_code
push_url_fmt = 'http://www.youtube-mp3.org/api/pushItem/?item=%s&xy=trve'
info_url_fmt = 'http://www.youtube-mp3.org/api/itemInfo/?video_id=%s'
download_url_fmt = 'http://www.youtube-mp3.org/get?video_id=%s&h=%s'
push_url = push_url_fmt % quote(yt_url)
data = urlopen(push_url).read()
sleep(10)
info_url = info_url_fmt % yt_code
data = urlopen(info_url).read()
res = findall('"h" : "([^"]*)"', data)
download_url = download_url_fmt % (yt_code, res[0])
print 'Download here:', download_url
You could use selenium to interact with the js stuff and then combine it with BeautifulSoup or do everything with selenium, just as you prefer.
http://seleniumhq.org/
Selenium is a tool for browser automatization and has bindings for a few languages including Python. It takes a running instance of Firefox/IE/Chrome and let's you script it (I suggest using the selenium webdriver for this simple problem, not the whole selenium server).
You're going to have to work through http://www.youtube-mp3.org/client.js and figure out the exact information that is being passed around, this could allow you to post a request, parse the response and download from the correct scraped url.

Categories