I have an XML like the following
<li class="expandSubItem">
<span class="expandSubLink">Popular Neighborhoods</span>
<ul class="secondSubNav" style="top:-0.125em;">
<li class="subItem">
<a class="subLink" href="/Hotels-g187147-zfn7236765-Paris_Ile_de_France-Hotels.html">Quartier Latin Hotels</a>
</li>
</ul>
</li>
<li class="expandSubItem">
<span class="expandSubLink">Popular Paris Categories</span>
<ul class="secondSubNav" style="top:-0.125em;">
<li class="subItem">
<a class="subLink" href="/HotelsList-Paris-Cheap-Hotels-zfp10420.html">Paris Cheap Hotels</a>
</li>
</ul>
</li>
I want to get all links under "Popular Paris Categories". I used something like this //li//a/#href/following::span[text()='Popular Singapore Categories'], but it gave no results. Any idea how to get the correct result? Here is the snippet of the python code that I wrote.
t_url = 'https://www.tripadvisor.com/Tourism-g187147-Paris_Ile_de_France-Vacations.html'
page = requests.get(t_url, timeout=30)
tree = html.fromstring(page.content)
links = tree.xpath('//li[span="Popular Paris Categories"]//a/#href')
print links
This is one possible way :
//li[normalize-space(span)="Popular Paris Categories"]//a/#href
Notice how normalize-space() is used to remove trailing space from the span content. This is the reason why the XPath I suggested initially in the comment didn't work for your actual HTML.
Something like this perhaps
//span[text()='Popular Paris Categories']/following-sibling::ul//a/#href
Related
I'm trying to parse through html using beautifulsoup (being called with lxml).
On nested tags I'm getting repeated text
I've tried going through and only counting tags that have no children, but then I'm losing out on data
given:
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments</span>
</li>
</ul>
</div>
and running:
soup = BeautifulSoup(file_info, features = "lxml")
soup.prettify().encode("utf-8")
for tag in soup.find_all(True):
if check_text(tag.text): #false on empty string/ all numbers
print (tag.text)
I get "to post comments" 4 times.
Is there a beautifulsoup way of just getting the result once?
Given an input like
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments1</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments2</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments3</span>
</li>
</ul>
</div>
You could do something like
[x.span.string for x in soup.find_all("li", class_="comment_forbidden first last")]
which would give
[' to post comments1', ' to post comments2', ' to post comments3']
find_all() is used to find all the <li> tags of class comment_forbidden first last and the <span> child tag of each of these <li> tag's content is obtained using their string attribute.
For anyone struggling with this, try swapping out the parser. I switched to html5lib and I no longer have repetitions. It is a costlier parser though, so may cause performance issues.
soup = BeautifulSoup(html, "html5lib")
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
You can use find() instead of find_all() to get your desired result only once
I have a python code that retrieve some data from a web page (web scrape).
Some point of the code it returns the follow list:
<ul class="nav nav--stacked" id="designer-list">
<li>
<h2>
<a class="text-uppercase bold router-link-active" href="/en-ca/cars_all">
All Cars
</a>
</h2>
</li>
<li>
<a href="/en-ca/cars/c1">
<span>
The car c1
</span>
</a>
</li>
<li>
<a href="/en-ca/cars/c2">
<span>
The car c2
</span>
</a>
</li>
</ul>
I am using BeautifulSoup and I just want to retrieve the references (href) for each car and its names.
In this example I want to retrieve (/en-ca/cars/c1)=>(The car c1) AND (/en-ca/cars/c2)=>(The car c2). I want to skip the first element (All cars).
I could use .find_all('li') and skip the first element inside the loop.
I was wondering if is there a way to reject the element trough BeautifulSoup methods
You can do it like this, though its not trough BeautifulSoup methods
soup = BeautifulSoup(html, "html.parser")
content = soup.find_all('li')[1:]
My intention is to scrape the names of the top-selling products on Ali-Express.
I'm using the Requests library alongside Beautiful Soup to accomplish this.
# Remember to import BeautifulSoup, requests and pprint
url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found
all_items = soup.find_all('li',class_= 'top10-item')
pp.pprint(all_items)
# []
However this returns an empty list, indicating that soup_find_all() did not find any tags fitting that criteria.
Inspect Element in Chrome displays the list items as such
.
However in source code (ul class = "top10-items") contains a script, which seems to iterate through each list item (I'm not familiar with HTML).
<div class="container">
<div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
<ul class="top10-items loading" id="bestselling-top10">
</ul>
<script class="X-template-top10" type="text/mustache-template">
{{#topList}}
<li class="top10-item">
<div class="rank-orders">
<span class="rank">{{rank}}</span>
<span class="orders">{{productOrderNum}}</span>
</div>
<div class="img-wrap">
<a href="{{productDetailUrl}}" target="_blank">
<img src="{{productImgUrl}}" alt="{{productName}}">
</a>
</div>
<a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
<p class="item-price">
<span class="price">US ${{productMinPrice}}</span>
<span class="uint">/ {{productUnitType}}</span>
</p>
</li>
{{/topList}}</script>
</div>
</div>
So this probably explains why soup.find_all() doesn't find the "li" tag.
My question is: How can I extract the item names from the script using Beautiful soup?
I'm trying to reach the address information on a site. Here's an example of my code:
companytype_list = sel.xpath('''.//li[#class="type"]/p/text()''').extract()
headquarters_list = sel.xpath('''.//li[#class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath('''.//li[#class="company-size"]/p/text()''').extract()
And here's an example of how addresses are formatted on the site:
<li class="type">
<h4>Type</h4>
<p>
Privately Held
</p>
</li>
<li class="vcard hq">
<h4>Headquarters</h4>
<p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
<span class="street-address" itemprop="streetAddress"></span>
<span class="locality" itemprop="addressLocality">Stockholm,</span>
<abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
<span class="postal-code" itemprop="postalCode">S-11127</span>
<span class="country-name" itemprop="addressCountry">Sweden</span>
</p>
</li>
<li class="company-size">
<h4>Company Size</h4>
<p>
11-50 employees
</p>
But when I run the scrapy script I get an IndexError: list index out of range for the address (vcard hq). I've tried to rewrite the code to get the data but it does not work. The rest of the spider works fine. Am I missing something?
Your example works fine. But I guess your xpath expressions failed on another page or html part.
The problem is the use of indexes (span[3]) in the headquarters_list xpath expression. Using indexes you heavily depend on:
1. The total number of the span elements
2. On the exact order of the span elements
In general the use of indexes tend to make xpath expressions more fragile and more likely to fail. Thus, if possible, I would always avoid the use of indexes. In your example you actually take the locality of the address info. The span element can also easily be referenced by its class name which makes your expression much more robust:
//li[#class="vcard hq"]/p/span[#class='locality']/text()
Here is my testing code according to your problem description:
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
html_text = """
<li class="type">
<h4>Type</h4>
<p>
Privately Held
</p>
</li>
<li class="vcard hq">
<h4>Headquarters</h4>
<p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
<span class="street-address" itemprop="streetAddress"></span>
<span class="locality" itemprop="addressLocality">Stockholm,</span>
<abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
<span class="postal-code" itemprop="postalCode">S-11127</span>
<span class="country-name" itemprop="addressCountry">Sweden</span>
</p>
</li>
<li class="company-size">
<h4>Company Size</h4>
<p>
11-50 employees
</p>
"""
sel = Selector(text=html_text)
companytype_list = sel.xpath(
'''.//li[#class="type"]/p/text()''').extract()
headquarters_list = sel.xpath(
'''.//li[#class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath(
'''.//li[#class="company-size"]/p/text()''').extract()
It doesn't raise any exception. So chances are there exist web pages with a different structure causing errors.
It's a good practice to not using index directly in xpath rules. dron22's answer gives an awesome explanation.
This is probably an easy question, but I can't figure it out.
I'm having trouble extracting email and url from this part of a webpage with BeautifulSoup:
<!-- ENDE telefonnummer.jsp --></li>
<li class="email ">
<a
class="link"
href="mailto:info#taxi-ac.de"
data-role="email-layer"
data-template-replacements='{
"name": "Aachener-Airport-Taxi Blum",
"subscriberId": "128027562762",
"captchaBase64": "data:image/jpg;base64,/9j/4AAQSkZJRgABAgAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAAvAG4DASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD02iivPLm58L6d4x1i4nsdLGpRNCsUcsSIFcASG5eUjEQLXCqWPJMfy72IWvyDD4d13JK90r6K/VLurb7nuzlynodcfqvxJ0PTda/se3ivtU1AMyPBp0HmsjKMkHJGTjOducbTnGK1rDw7awanJrF7FaXOsy4DXcdsI9oClQEBLEcE5JYk5xnaFVfGXTxP8JPElxqVxbQ6jaXshja8lGTOMhz8+S0bnng5BIJw+0GvRy3A4fETnHm5pJe6r8vM+uuui+TflqY1akoJO1l9567oPjKz13VJtMOnapp17HCLgQ6hbeUzx7tpZeTwDgc468ZwcM/4WD4ZOp/2bHfTTXh+7DDZzyFxt3ArtQ7lK/MCMgjkcVZ8LeLdK8X6e93pkjgxttlglAWSM9twBPBAyCCR17ggeWaHfw3fx11nU9QS53WTTrEtnbSTZKYgG5UVmxsySeBux06VVDAU6s6yqQlHkjeyet+2q2YSqOKjZp3Z61pnibSdX1C50+0uX+22yh5baeCSGRVPQ7XUEjkdPUeorz/xt8SPEPgzxWthJbaXdWUircR7UkSQxFiNpO4gN8pGcEdDjsKvhy1k8U/GOfxdpjQtpEWcu0yeYf3JhH7sEuu4gsNwXKjPXiuw1zw/Z+J9Y13S7xEIk0y0MUjLuMMm+62uORyCfUZGQeCa1jRwmDxKVVc0eVOSe8W2k1p1W/Tt5icp1Ie7o76eZu2epf294ft9R0a4hj+1RrJE80fmhPVWVXHzDlSA3BHfGKg8N3ep31pcT6jPaSbbmaCNbe3aLHlSvGSdztnO0HHGOnNeH6Dr2ufCfxPNpWqwvJYOwaaBTlXU8CaInHOB7ZxtOCPl9t8IzRXGhPPBIksUl/eukiMGVlN1KQQR1BFY5jl7wcG4tShJrllptrpf7v6uOlV9o9dGtzdooorxToCiiigCG7uo7K2e4lWZkTGRDC8rcnHCoCx69hWF4ZeHUNN1GG5s7ndNd3DTre2ckfnRvK4jz5ijePKCLjnChQccCujorWNRRpuNtW1rft/XfsJq7uclot5qem+EDY2+mTX2p6V/oiQtG1otwiSGNHV5AVOY1DnBI5xxkVn6j45vL7T57Ow8Ea/NdXC+THHqFhstyW4/eHcflwec4B6EjqO9orojiqXO5zpptu+7Xy06fc/MhwdrJnmXgjw7efDnwvqV9f2013qd5tKWdmrzfdQlEOxDtYsWBblR8vPrmfCS3k8LWWpy6vZavb3F1JGqwf2TcPhUBw25UI5LkY7bfevYKK6Z5rKrGqqsbuo1dp2+HZLRkqik1boeWfDvw7q6eOtc8UXdhNY2N95/kR3Q2THfMGGU5K4C85x1GMjmupstYgk8X3c4tdUWK5tLWCOR9MuUUusk5YEmMbQBIvJwOevBrqqKxxGO+sTlOpHdJKz2S+++w40+RJJnK+PfBsXjPQRarIkN7A3mW0zKCA2MFWOMhW4zjuFPOME+HGnXmk+A9PsL+3eC6gaZZI36g+c/5gjkEcEEEV1VFYvGVXhvqr+FO68t/wDMr2a5+fqFFFFcpYUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFAH//Z",
"captchaWidth": "110",
"captchaHeight": "47",
"captchaEncryptedAnswer": "767338fffffff8ffffffd6ffffff8d3038ffffffba1971ffffffdfffffffe3f6c9"
}'
data-wipe='{"listener":"click","name":"Detailseite E-Mail","id":"128027562762"}'
>
<i class="icon-mail"></i>
<span class="text" >info#taxi-ac.de</span>
</a>
</li>
<li class="website ">
<a class="link" href="http://www.aachener-airport-taxi.de" rel="follow" target="_blank" title="http://www.aachener-airport-taxi.de"
data-wipe='{"listener":"click","name":"Detailseite Webadresse","id":"128027562762"}'>
<i class="icon-website"></i>
<span class="text">Zur Website</span>
</a>
</li>
</ul>
</div>
I'm trying to get info#taxi-ac.de and http://www.aachener-airport-taxi.de out of there.
soup.find(class='email') obviously doesn't work because class makes the compiler think that I want to declare one inside the brackets. While I can use
for link in soup.find_all('a'):
print(link.get('href')) to get ALL the links in there, I want this specific one. The links are always different, so I can't regex for them, so I guess one would have to navigate through the html-body by hand.
print(soup.find("span",{"class":"text"}).text)
print(soup.find(attrs={"class":"website"}).a["href"])
info#taxi-ac.de
http://www.aachener-airport-taxi.de