I am trying to scrape a website using scrapy, and the structure that i am trying to get data from is as below:
<div class="AA BB">
<span data-format-supply data-format-value="16">16</span>HAYA
</div>
<div class="AA BB">
<span data-format-supply data-format-value="21">42</span>
<span data-format-supply data-format-value="21">21</span>HAYA
</div>
I want to extract the text within second div span. For example, in this case, I want to extract 21. My code is as below:
def parse(self, response):
sel = Selector(response)
sel.css('div.coin-summary-item-detail').extract()
My question is , how do I select the second AA class using css ? and after that, how do I specify that I want only the text inside second span ?
Any help will be greatly appreciated, thank you !!!
.extract() returns a list of div's matching that path. So you can iterate and print the required css class. There are methods to extract the first element but not the second ones.
Please try:
def parse(self, response):
sel = Selector(response)
results = sel.css('div.data-format-supply').extract()
for index, result in enumerate(results):
if(index == 1):
print result
Try sel.css('div.AA + span:last-child::text').extract()
The character + tells a spider to choose the span element in the brother element of a div whose class attribute has 'AA', i.e. the brother element = the second div who also has 'AA' in its class attribute.
span:last-child tells a spider to choose the last span element among span elements (under the second div, of course.) You may also write it this way: span:nth-child(2), which means to choose the second span element, obviously.
Related
I get this output list of html
HyperSense Software
QSS Technosoft - A CMMI Level 3 Certified Company
and more in the same format I need to extract href link from them?
My code
mainurl="https://www.appfutura.com/app-developers"
html = urlopen(mainurl).read()
main_soup = BeautifulSoup(html,"lxml")
allurl=main_soup.find_all('h3')
for i in allurl:
for a in i :
print(a)
How can I extract href in this loop?
You're close. One small change in your for loop:
for i in allurl:
print(i.a["href"])
This gets the child with tag "a" and then the "href" attribute for that tag.
If you aren't sure how many "a" tags there are in each "h3" block, or there are more than one, you can use another for loop (or depending on what you're doing, list comprehensions):
for i in allurl:
aa = i.find_all('a')
for j in aa:
print(j["href"])
I found a way using css selector
urllist=[]
mainurl="https://www.appfutura.com/app-developers"
html = urlopen(mainurl).read()
main_soup = BeautifulSoup(html,"lxml")
elms = main_soup.select("h3 a")
for i in elms:
urllist.append(i.attrs["href"])
print(urllist)
Thanks !!
how to get all text after third p tag from this code in BeautifulSoup web scraping.
questions = soup.find('div',{'class':'entry-content'})
exp = questions.p[3].text
(there is c a way something like this but i cant get it. )
anyone here can help. shall be very thanksfullenter image description here
Try below code, if that helps:
#This will fetch first div with class entry-content.
# In case if that is not the first div then instead use find_all and select the
# appropriate div with help of indexing.
questions = soup.find('div', class_= 'entry-content')
#This will get all the p tags present in questions.
p_tags = questions.find_all('p')
lst=[]
for tag in p_tags[3:]:
lst.append(tag.text)
#This will get you the text of the 4th <p> tag.
exp = p_tags[3].text
This questions = soup.find('div',{'class':'entry-content'})
Only finds one p tag,
you need:
questions = soup.find_all('div',{'class':'entry-content'})
To find all the p tags, then you can use [3]
I'm working in python on a Selenium problem. I'm trying to gather each element with an h1 tag and following that tag, I want to get the closest h2 and paragraph text tags and place that data into an object.
My current code looks like:
cards = browser.find_elements_by_tag_name("h1")
ratings = browser.find_elements_by_tag_name('h3')
descriptions = browser.find_elements_by_tag_name('p')
print(len(cards))
print(len(ratings))
print(len(descriptions))
which is generating inconsistent numbers.
To get the <h1> tag elements and then the next sibling <h2> and <p> tag elements you can use the following solution:
cards = browser.find_elements_by_tag_name("h1")
ratings = browser.find_elements_by_xpath("//h1//following-sibling::h2")
descriptions = browser.find_elements_by_xpath("//h1//following-sibling::p")
print(len(cards))
print(len(ratings))
print(len(descriptions))
I've written an xpath expression to get the highest value of page number from some html elements. However, with the below xpath I'm getting the last text which is Next Page in this case. I wish my xpath act in such a way so that I can get the highest number, as in 6 using it.
The elements upon which the xpath should be applied:
content = """
<div class="nav-links"><span aria-current="page" class="page-numbers current"><span class="meta-nav screen-reader-text">Page </span>1</span>
<a class="page-numbers" href="https://page/2/"><span class="meta-nav screen-reader-text">Page </span>2</a>
<span class="page-numbers dots">…</span>
<a class="page-numbers" href="https://page/6/"><span class="meta-nav screen-reader-text">Page </span>6</a>
<a class="next page-numbers" href="https://page/2/"><span class="screen-reader-text">Next Page</span></a></div>
"""
What I've tried so far:
from lxml.html import fromstring
root = fromstring(above_content)
pagenum = root.xpath("//*[contains(#class,'page-numbers')][last()]/span")[0].text
print(pagenum)
Output I'm having:
Next Page
Output I wish to have:
6
You can use exact class name to avoid fetching Next link:
//a[#class="page-numbers"][last()]
Note that contains(#class,'page-numbers') will return you links with numbers and Next while #class="page-numbers" numbers only
I am trying to scrape a site with multiple pages. I would like to build a function that returns the number of pages within a set of pages.
Here is an example starting page.
There are 29 sub pages within that leading page, ideally the function would therefore return 29.
By subpage I mean, page 1 of 29, 2 of 29 etc etc.
This is the HTML snippet which contains the last page information, from the link posted above.
<div id="paging-wrapper-btm" class="paging-wrapper">
<ol class="page-nos"><li ><span class="selected">1</span></li><li ><a href='http://www.asos.de/Herren-Jeans/podlh/?cid=4208&pge=1&pgesize=36&sort=-1'>Weiter »</li></ol>
I have the following code which will find all ol tags, but can't figure out how to access the contents contained within in each 'a' .
a = soup.find_all('ol')
b = [x['a'] for x in a] <-- this part returns an error.
< further processing >
Any help/suggestions much appreciated.
Ah.. I found a simple solution.
for item in soup.select("ol a"):
x = item.text
print x
I can then sort and select the largest number.
Try this:
ols = soup.find_all('ol')
list_of_as = [ol.find_all('a') for ol in ols] # Finds all a's inside each ol in the ols list
all_as = []
for a in list_of_as: # This is to expand each sublist of a's and put all of them in one list
all_as.extend(a)
print all_as
The following would extract the last page number:
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.asos.de/Herren-Jeans/podlh/?cid=4208&via=top&r=2#parentID=-1&pge=1&pgeSize=36&sort=-1")
soup = BeautifulSoup(html.text)
ol = soup.find('ol', class_='page-nos')
pages = [li.text for li in ol.find_all('li')]
last_page = pages[-2]
print last_page
Which for your website will display:
30