How to follow pagination with scrapy - python

I have this target url:
<nav>
<ul class="pagination pagination-lg">
<li class="active" itemprop="pageStart">
1</li>
<li itemprop="pageEnd">
2</li>
<li>
<a href="moto-2.html" aria-label="Next" class="xh-highlight">
<span aria-hidden="true">ยป</span></a>
</li><
</ul>
</nav>
but I cant select the next page link, I try with:
next_page_url = response.xpath('./div/div/div[1]/nav/ul/li[3]/a').extract_first()
also with
response.css('[class="xh-highlight"]').extract()
I only get as result [] on the shell
other point: I set the user agent as google chrome because I read here about other user with problems on mark accents, but don't fix my problem

I want to warn you Scrapy cannot scrape website rendered with javascript. Consider using a web driver like Selenuim with scrapy if the page is rendered in javascript.
I would recommend you go to scrapy shell, and type view(response). If you see a blank page than the page is rendered in javascript.
This is how you get urls from xpath, but I doubt it will make a difference sence you see no object
next_page_url = response.xpath('nav/ul/li[3]/a/text()')

Related

Tag <li> not showing when using BeautifulSoup in Python

I was learning web scraping, and the 'li' tag is not showing when I run soup.findAll
Here's the html:
<label>
<input type="checkbox">
<ul class="dropdown-content">
<li>
<a href=stuff</a>
</li>
</ul>
</label>
I tried:
soup = BeautifulSoup(r.content,'html5lib')
dropdown = soup.findAll('ul', {'class':'dropdown-content'})
print(dropdown)
And it only shows:
[<ul class="dropdown-content"></ul>]
Any help will do. Thanks!
in this command: dropdown = soup.findAll('ul', {'class':'dropdown-content'}), yo search for ul and dropdown-content class.
dropdown = soup.find('ul').findAll('li')
Your selection per se is okay to find the <ul> it may do not contain any <li> cause I assume these elements are generated dynamically by javascript. To validate this, question should be improved and url of website should be provided.
If content is provided dynamically one approach could be to work with selenium that will render the website like a browser and could return the "full" dom.
Note: In new code use find_all() instead of old syntax findAll()
Example
Html in your example is broken, but your code works if any lis are in the ul in your soup.
import requests
from bs4 import BeautifulSoup
html = '''
<label>
<input type="checkbox">
<ul class="dropdown-content">
<li>
</li>
</ul>
</label>
'''
soup = BeautifulSoup(html,'html5lib')
dropdown = soup.find_all('ul', {'class':'dropdown-content'})
print(dropdown)
Output
[<ul class="dropdown-content">
<li>
</li>
</ul>]

How can I make my navigation links work from any page?

I have a Django template containing this menu:
<ul class="menu">
<li class="nav-current" role="presentation">index</li>
<li role="presentation">cpu info</li>
<li role="presentation">about</li>
</ul>
When I click "cpu info" from the home page my browser goes to /cpuinfo. This works.
But when I am on other pages like /post/ that link takes me to /post/cpuinfo, which isn't correct.
How can I make my link work from any page?
You need url in template, for example:
<li role="presentation">cpu info</li>
<!-- Change it ^^^ on real url name-->

scrapy xpath how to use?

guys,
I have a question, scrapy, selector, XPath
I would like to choose the link in the "a" tag in the last "li" tag in HTML, and how to write the query for XPath
I did that, but I believe there are simpler ways to do that, such as using XPath queries, not using list fragmentation, but I don't know how to write
from scrapy import Selector
sel = Selector(text=html)
print sel.xpath('(//ul/li)').xpath('a/#href').extract()[-1]
'''
html
'''
</ul>
<li>
<a href="/info/page/" rel="follow">
<span class="page-numbers">
35
</span>
</a>
</li>
<li>
<a href="/info/page/" rel="follow">
<span class="next">
next page.
</span>
</a>
</li>
</ul>
I am assuming you want specifically the link to the "next" page. If this is the case, you can locate an a element checking the child span to the "next" class:
//a[span/#class = "next"]/#href

Web Scraping with Python Request/lxml: Getting data from ul/li

so I'm pretty new to this, and I haven't been able to find anything on google on this question.
I'm using request and lxml with Python, I've seen that there's a lot of different modules for web scraping, but is there any reason to choose one over the other? Can you do the same stuff with requests/lxml as you can with for example BeautifulSoup?
Anyway, here's my actual question;
This is my code:
import requests
from lxml import html
# Login data
inputUrl = 'http://forum.mytestsite.com/login'
usr = 'myusername'
pwd = 'mypassword'
payload = dict(login=usr, password=pwd)
# Open session
with requests.Session() as s:
# Login
s.post(inputUrl, data=payload)
# Get page data
pageResult = s.get('http://forum.mytestsite.com/icons/', allow_redirects=False)
pageResult = html.fromstring(pageResult.content)
pageIcons = pageResult.xpath('//script[#id="table-icons"]/text()')
print pageIcons[0]
The result when printing pageIcons[0]:
<ul id="icons">
{{#each icons}}
<li data-handle="{{handle}}">
<img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
</li>
{{/each}}
</ul>
This is the website/js code that generates the icons:
<script id="table-icons" type="text/x-handlebars-template">
<ul id="icons">
{{#each icons}}
<li data-handle="{{handle}}">
<img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
</li>
{{/each}}
</ul>
</script>
And here's the result on the page:
<ul id="icons">
<li data-handle="558FSTBI" class="">
<img src="http://testsite.com/icons/558FSTBI.1.png" alt="Icon 1" title="Icon 1">
</li>
<li data-handle="310AYTZI">
<img src="http://testsite.com/icons/310AYTZI.1.png" alt="Icon 2" title="Icon 2">
</li>
<li data-handle="669PQXBI" class="">
<img src="http://testsite.com/icons/669PQXBI.1.png" alt="Icon 3" title="Icon 3">
</li>
</ul>
My goal:
What I would like to do is to retrieve all of li data-handles, but I haven't been able to figure out how to retrieve this data. So my goal is to retrieve all of the icon paths and their titles, could anyone help me out here? I'd really appreciate any help :)
You aren't parsing the li or ul.
Start with this
//ul[#id='icons']/li/img
And from those elements, you can extract the individual information
Regarding the first question, beautifulsoup optionally uses lxml. If you don't think you need it, and are comfortable with XPath, don't worry about it.
However, since it's Javascript generating the page, you need a headless browser rather than requests library.
Get page generated with Javascript in Python
Reading dynamically generated web pages using python

Spider not scraping the right amount of items

I have been learning Scrapy for the past couple of days, and I am having trouble with getting all the list elements on the page.
So the page has a similar structure like this:
<ol class="list-results">
<li class="SomeClass i">
<ul>
<li class="name">Name1</li>
</ul>
</li>
<li class="SomeClass 0">
<ul>
<li class="name">Name2</li>
</ul>
</li>
<li class="SomeClass i">
<ul>
<li class="name">Name3/li>
</ul>
</li>
</ol>
In the Parse function of Scrapy, I get all the list elements like this:
def parse(self, response):
sel = Selector(response)
all_elements = sel.css('.SomeClass')
print len(all_elemts)
I know that there are about 300 list elements with that class on the test page that I request, however after printing the len(all_elements), I am getting only 61.
I have tried using xpaths like:
sel.xpath("//*[contains(concat(' ', #class, ' '), 'SomeClass')]")
And yet still I am getting like 61 elements instead of the 300 that I should be.
Also I am using a try and except claws in case one element was to give me an exception.
Here is the actual page I would be scraping:
https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter=
Please understand, I am doing this for practice only!
Please Help!Thank You! I just don't know what else to do!
I am afraid you are dealing with a non well-formed and broken HTML which Scrapy (and underlying lxml) is not able to parse reliably. For instance, see this unclosed div inside the li tag:
<li class="unit"><span>Unit:</span>
<div class="unit-block"> Language Program
</li>
I'd switch to parsing the HTML manually with BeautifulSoup. In other words, continue to use all other parts and components of the Scrapy framework, but the HTML-parsing part leave to BeautifulSoup.
Demo from the scrapy shell:
$ scrapy shell "https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter="
In [1]: len(response.css('li.student'))
Out[1]: 55
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(response.body)
In [4]: len(soup.select('li.student'))
Out[4]: 281
If you are using a CrawlSpider and need a LinkExtractor based on BeautifulSoup, see:
A scrapy link extractor that uses BeautifulSoup

Categories