How to get the highest page number using xpath? - python

I've written an xpath expression to get the highest value of page number from some html elements. However, with the below xpath I'm getting the last text which is Next Page in this case. I wish my xpath act in such a way so that I can get the highest number, as in 6 using it.
The elements upon which the xpath should be applied:
content = """
<div class="nav-links"><span aria-current="page" class="page-numbers current"><span class="meta-nav screen-reader-text">Page </span>1</span>
<a class="page-numbers" href="https://page/2/"><span class="meta-nav screen-reader-text">Page </span>2</a>
<span class="page-numbers dots">…</span>
<a class="page-numbers" href="https://page/6/"><span class="meta-nav screen-reader-text">Page </span>6</a>
<a class="next page-numbers" href="https://page/2/"><span class="screen-reader-text">Next Page</span></a></div>
"""
What I've tried so far:
from lxml.html import fromstring
root = fromstring(above_content)
pagenum = root.xpath("//*[contains(#class,'page-numbers')][last()]/span")[0].text
print(pagenum)
Output I'm having:
Next Page
Output I wish to have:
6

You can use exact class name to avoid fetching Next link:
//a[#class="page-numbers"][last()]
Note that contains(#class,'page-numbers') will return you links with numbers and Next while #class="page-numbers" numbers only

Related

Extracting text inside tags from html document

I have an html document like this: https://dropmefiles.com/wezmb
So I need to extract text inside tags <span id="1" and </span , but I don't know how.
I'm trying and write this code:
from bs4 import BeautifulSoup
with open("10_01.htm") as fp:
soup = BeautifulSoup(fp,features="html.parser")
for a in soup.find_all('span'):
print (a.string)
But it extract all information from all 'span' tags. So, how can i extract text inside tags <span id="1" and </span in Python?
What you need is the .contents function. documentation
Find the span <span id = "1"> ... </span> using
for x in soup.find(id = 1).contents:
print(x)
OR
x = soup.find(id = 1).contents[0] # since there will only be one element with the id 1.
print(x)
This will give you :
10
that is, an empty line followed by 10 followed by another empty line. This is because the string in the HTML is actually like that and prints 10 in a new line, as you can also see in the HTML that 10 has its separate line.
The string will correctly be '\n10\n'.
If you want just x = '10' from x = '\n10\n' you can do : x = x[1:-1] since '\n' is a single character. Hope this helped.

How to get the <h1> elements and then next sibling <h2> and <p> elements

I'm working in python on a Selenium problem. I'm trying to gather each element with an h1 tag and following that tag, I want to get the closest h2 and paragraph text tags and place that data into an object.
My current code looks like:
cards = browser.find_elements_by_tag_name("h1")
ratings = browser.find_elements_by_tag_name('h3')
descriptions = browser.find_elements_by_tag_name('p')
print(len(cards))
print(len(ratings))
print(len(descriptions))
which is generating inconsistent numbers.
To get the <h1> tag elements and then the next sibling <h2> and <p> tag elements you can use the following solution:
cards = browser.find_elements_by_tag_name("h1")
ratings = browser.find_elements_by_xpath("//h1//following-sibling::h2")
descriptions = browser.find_elements_by_xpath("//h1//following-sibling::p")
print(len(cards))
print(len(ratings))
print(len(descriptions))

multiple divs with scrapy

I am trying to scrape a website using scrapy, and the structure that i am trying to get data from is as below:
<div class="AA BB">
<span data-format-supply data-format-value="16">16</span>HAYA
</div>
<div class="AA BB">
<span data-format-supply data-format-value="21">42</span>
<span data-format-supply data-format-value="21">21</span>HAYA
</div>
I want to extract the text within second div span. For example, in this case, I want to extract 21. My code is as below:
def parse(self, response):
sel = Selector(response)
sel.css('div.coin-summary-item-detail').extract()
My question is , how do I select the second AA class using css ? and after that, how do I specify that I want only the text inside second span ?
Any help will be greatly appreciated, thank you !!!
.extract() returns a list of div's matching that path. So you can iterate and print the required css class. There are methods to extract the first element but not the second ones.
Please try:
def parse(self, response):
sel = Selector(response)
results = sel.css('div.data-format-supply').extract()
for index, result in enumerate(results):
if(index == 1):
print result
Try sel.css('div.AA + span:last-child::text').extract()
The character + tells a spider to choose the span element in the brother element of a div whose class attribute has 'AA', i.e. the brother element = the second div who also has 'AA' in its class attribute.
span:last-child tells a spider to choose the last span element among span elements (under the second div, of course.) You may also write it this way: span:nth-child(2), which means to choose the second span element, obviously.

Using SeleniumDriver to pull data from multiple spans

So I have multiple webpages that have essentially the main part a bunch of <span> tags. In these the id for the span is random.
The basic page structure is as follows:
<pre>
<span id = "abcdf">
x
<span title="1">
random text
random text
</span>
... Repeat ...
<span id = "awfaf">
x
<span title="127">
random text
random text
</span>
</pre>
The id is always random, and the title for the span is always an integer. (It increases so 1-128 on page one, 129 to 256 on page two. Etc.)
What I would like to do is pull the id of the span, and then the two columns/text in the second and third href of each page.
I'm not sure how to go around to doing this in a repeatable way and simply need an idea for the logic, that is which elements to pull and such when going through the pages.
Following is one of the way to get the required data using Java:
List<String> idList = new ArrayList<String>();
List<String> textList1 = new ArrayList<String>();
List<String> textList2 = new ArrayList<String>();
int i=1;
while (driver.findElements(By.xpath("//pre/span[" + i + "]")).size() != 0) {
idList.add(driver.findElement(By.xpath("//pre/span[" + i + "]")).getAttribute("id"));
textList1.add(driver.findElement(By.xpath("//pre/span[" + i + "]/a[2]")).getText());
textList2.add(driver.findElement(By.xpath("//pre/span[" + i + "]/a[3]")).getText());
i++;
}
Above code can be executed for each page.

get last page number - web scraping

I am trying to scrape a site with multiple pages. I would like to build a function that returns the number of pages within a set of pages.
Here is an example starting page.
There are 29 sub pages within that leading page, ideally the function would therefore return 29.
By subpage I mean, page 1 of 29, 2 of 29 etc etc.
This is the HTML snippet which contains the last page information, from the link posted above.
<div id="paging-wrapper-btm" class="paging-wrapper">
<ol class="page-nos"><li ><span class="selected">1</span></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>Weiter »</li></ol>
I have the following code which will find all ol tags, but can't figure out how to access the contents contained within in each 'a' .
a = soup.find_all('ol')
b = [x['a'] for x in a] <-- this part returns an error.
< further processing >
Any help/suggestions much appreciated.
Ah.. I found a simple solution.
for item in soup.select("ol a"):
x = item.text
print x
I can then sort and select the largest number.
Try this:
ols = soup.find_all('ol')
list_of_as = [ol.find_all('a') for ol in ols] # Finds all a's inside each ol in the ols list
all_as = []
for a in list_of_as: # This is to expand each sublist of a's and put all of them in one list
all_as.extend(a)
print all_as
The following would extract the last page number:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.asos.de/Herren-Jeans/podlh/?cid=4208&via=top&r=2#parentID=-1&pge=1&pgeSize=36&sort=-1")
soup = BeautifulSoup(html.text)
ol = soup.find('ol', class_='page-nos')
pages = [li.text for li in ol.find_all('li')]
last_page = pages[-2]
print last_page
Which for your website will display:
30

Categories