Spider not scraping the right amount of items

Spider not scraping the right amount of items - python

I have been learning Scrapy for the past couple of days, and I am having trouble with getting all the list elements on the page.
So the page has a similar structure like this:
<ol class="list-results">
<li class="SomeClass i">
<ul>
<li class="name">Name1</li>
</ul>
</li>
<li class="SomeClass 0">
<ul>
<li class="name">Name2</li>
</ul>
</li>
<li class="SomeClass i">
<ul>
<li class="name">Name3/li>
</ul>
</li>
</ol>
In the Parse function of Scrapy, I get all the list elements like this:
def parse(self, response):
sel = Selector(response)
all_elements = sel.css('.SomeClass')
print len(all_elemts)
I know that there are about 300 list elements with that class on the test page that I request, however after printing the len(all_elements), I am getting only 61.
I have tried using xpaths like:
sel.xpath("//*[contains(concat(' ', #class, ' '), 'SomeClass')]")
And yet still I am getting like 61 elements instead of the 300 that I should be.
Also I am using a try and except claws in case one element was to give me an exception.
Here is the actual page I would be scraping:
https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter=
Please understand, I am doing this for practice only!
Please Help!Thank You! I just don't know what else to do!

I am afraid you are dealing with a non well-formed and broken HTML which Scrapy (and underlying lxml) is not able to parse reliably. For instance, see this unclosed div inside the li tag:
<li class="unit"><span>Unit:</span>
<div class="unit-block"> Language Program
</li>
I'd switch to parsing the HTML manually with BeautifulSoup. In other words, continue to use all other parts and components of the Scrapy framework, but the HTML-parsing part leave to BeautifulSoup.
Demo from the scrapy shell:
$ scrapy shell "https://search.msu.edu/people/index.php?fst=ab&lst=&nid=&filter="
In [1]: len(response.css('li.student'))
Out[1]: 55
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(response.body)
In [4]: len(soup.select('li.student'))
Out[4]: 281
If you are using a CrawlSpider and need a LinkExtractor based on BeautifulSoup, see:
A scrapy link extractor that uses BeautifulSoup

Related

find xpath with specified ID regex

I am trying to scrape a webpage with uses the following <li id="size_name_1" ....> <li id="size_name_2"....> <li id="size_name_a" is there a way to find size_name_NUMBER' such as
response.xpath('//*[#id="size_name_\d+"]')
I want to use regex in the id search, Note I use scrapy.

You could do this with css selectors instead by using regex to grab the appropriate ids first. I do note you are using scrapy but same principle should apply.
from bs4 import BeautifulSoup
import re
html = '''
<html>
<head></head>
<body>
<li id="size_name_1" > me </li>
<li id="size_name_2" > and me </li>
<li id="size_name_a" > but not me :-(</li>
</body>
</html>
'''
p = re.compile(r'id="(size_name_\d+)"')
ids = p.findall(html)
soup = bs(html, 'lxml')
for i in ids:
print(soup.select_one(f'li[id="{i}"]'))

How to follow pagination with scrapy

I have this target url:
<nav>
<ul class="pagination pagination-lg">
<li class="active" itemprop="pageStart">
1</li>
<li itemprop="pageEnd">
2</li>
<li>
<a href="moto-2.html" aria-label="Next" class="xh-highlight">
<span aria-hidden="true">»</span></a>
</li><
</ul>
</nav>
but I cant select the next page link, I try with:
next_page_url = response.xpath('./div/div/div[1]/nav/ul/li[3]/a').extract_first()
also with
response.css('[class="xh-highlight"]').extract()
I only get as result [] on the shell
other point: I set the user agent as google chrome because I read here about other user with problems on mark accents, but don't fix my problem

I want to warn you Scrapy cannot scrape website rendered with javascript. Consider using a web driver like Selenuim with scrapy if the page is rendered in javascript.
I would recommend you go to scrapy shell, and type view(response). If you see a blank page than the page is rendered in javascript.
This is how you get urls from xpath, but I doubt it will make a difference sence you see no object
next_page_url = response.xpath('nav/ul/li[3]/a/text()')

scrapy xpath how to use?

guys,
I have a question, scrapy, selector, XPath
I would like to choose the link in the "a" tag in the last "li" tag in HTML, and how to write the query for XPath
I did that, but I believe there are simpler ways to do that, such as using XPath queries, not using list fragmentation, but I don't know how to write
from scrapy import Selector
sel = Selector(text=html)
print sel.xpath('(//ul/li)').xpath('a/#href').extract()[-1]
'''
html
'''
</ul>
<li>
<a href="/info/page/" rel="follow">
<span class="page-numbers">
35
</span>
</a>
</li>
<li>
<a href="/info/page/" rel="follow">
<span class="next">
next page.
</span>
</a>
</li>
</ul>

I am assuming you want specifically the link to the "next" page. If this is the case, you can locate an a element checking the child span to the "next" class:
//a[span/#class = "next"]/#href

Web Scraping with Python Request/lxml: Getting data from ul/li

so I'm pretty new to this, and I haven't been able to find anything on google on this question.
I'm using request and lxml with Python, I've seen that there's a lot of different modules for web scraping, but is there any reason to choose one over the other? Can you do the same stuff with requests/lxml as you can with for example BeautifulSoup?
Anyway, here's my actual question;
This is my code:
import requests
from lxml import html
# Login data
inputUrl = 'http://forum.mytestsite.com/login'
usr = 'myusername'
pwd = 'mypassword'
payload = dict(login=usr, password=pwd)
# Open session
with requests.Session() as s:
# Login
s.post(inputUrl, data=payload)
# Get page data
pageResult = s.get('http://forum.mytestsite.com/icons/', allow_redirects=False)
pageResult = html.fromstring(pageResult.content)
pageIcons = pageResult.xpath('//script[#id="table-icons"]/text()')
print pageIcons[0]
The result when printing pageIcons[0]:
<ul id="icons">
{{#each icons}}
<li data-handle="{{handle}}">
<img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
</li>
{{/each}}
</ul>
This is the website/js code that generates the icons:
<script id="table-icons" type="text/x-handlebars-template">
<ul id="icons">
{{#each icons}}
<li data-handle="{{handle}}">
<img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
</li>
{{/each}}
</ul>
</script>
And here's the result on the page:
<ul id="icons">
<li data-handle="558FSTBI" class="">
<img src="http://testsite.com/icons/558FSTBI.1.png" alt="Icon 1" title="Icon 1">
</li>
<li data-handle="310AYTZI">
<img src="http://testsite.com/icons/310AYTZI.1.png" alt="Icon 2" title="Icon 2">
</li>
<li data-handle="669PQXBI" class="">
<img src="http://testsite.com/icons/669PQXBI.1.png" alt="Icon 3" title="Icon 3">
</li>
</ul>
My goal:
What I would like to do is to retrieve all of li data-handles, but I haven't been able to figure out how to retrieve this data. So my goal is to retrieve all of the icon paths and their titles, could anyone help me out here? I'd really appreciate any help :)

You aren't parsing the li or ul.
Start with this
//ul[#id='icons']/li/img
And from those elements, you can extract the individual information
Regarding the first question, beautifulsoup optionally uses lxml. If you don't think you need it, and are comfortable with XPath, don't worry about it.
However, since it's Javascript generating the page, you need a headless browser rather than requests library.
Get page generated with Javascript in Python
Reading dynamically generated web pages using python

Python scraping xpath get <a> with specific <span>

I'm using Scrapy to get some data from a website.
I have the following list of links:
<li class="m-pagination__item">
10
</li>
<li class="m-pagination__item">
<a href="?isin=IT0000072618&lang=it&page=1">
<span class="m-icon -pagination-right"></span>
</a>
I want to extract the href attribute only of the 'a' element that contains the span class="m-icon -pagination-right".
I've been looking for some examples of xpath but I'm not an expert of xpath and I couldn't find a solution.
Thanks.

//a[span/#class = 'm-icon -pagination-right']/#href

With a Scrapy response:
response.css('span.m-icon').xpath('../#href')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spider not scraping the right amount of items - python

Related

find xpath with specified ID regex

How to follow pagination with scrapy

scrapy xpath how to use?

Web Scraping with Python Request/lxml: Getting data from ul/li

Python scraping xpath get <a> with specific <span>

Categories

Resources