Extracting data with BeautifulSoup

Extracting data with BeautifulSoup - python

This is probably an easy question, but I can't figure it out.
I'm having trouble extracting email and url from this part of a webpage with BeautifulSoup:
<!-- ENDE telefonnummer.jsp --></li>
<li class="email ">
<a
class="link"
href="mailto:info#taxi-ac.de"
data-role="email-layer"
data-template-replacements='{
"name": "Aachener-Airport-Taxi Blum",
"subscriberId": "128027562762",
"captchaBase64": "data:image/jpg;base64,/9j/4AAQSkZJRgABAgAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAAvAG4DASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD02iivPLm58L6d4x1i4nsdLGpRNCsUcsSIFcASG5eUjEQLXCqWPJMfy72IWvyDD4d13JK90r6K/VLurb7nuzlynodcfqvxJ0PTda/se3ivtU1AMyPBp0HmsjKMkHJGTjOducbTnGK1rDw7awanJrF7FaXOsy4DXcdsI9oClQEBLEcE5JYk5xnaFVfGXTxP8JPElxqVxbQ6jaXshja8lGTOMhz8+S0bnng5BIJw+0GvRy3A4fETnHm5pJe6r8vM+uuui+TflqY1akoJO1l9567oPjKz13VJtMOnapp17HCLgQ6hbeUzx7tpZeTwDgc468ZwcM/4WD4ZOp/2bHfTTXh+7DDZzyFxt3ArtQ7lK/MCMgjkcVZ8LeLdK8X6e93pkjgxttlglAWSM9twBPBAyCCR17ggeWaHfw3fx11nU9QS53WTTrEtnbSTZKYgG5UVmxsySeBux06VVDAU6s6yqQlHkjeyet+2q2YSqOKjZp3Z61pnibSdX1C50+0uX+22yh5baeCSGRVPQ7XUEjkdPUeorz/xt8SPEPgzxWthJbaXdWUircR7UkSQxFiNpO4gN8pGcEdDjsKvhy1k8U/GOfxdpjQtpEWcu0yeYf3JhH7sEuu4gsNwXKjPXiuw1zw/Z+J9Y13S7xEIk0y0MUjLuMMm+62uORyCfUZGQeCa1jRwmDxKVVc0eVOSe8W2k1p1W/Tt5icp1Ie7o76eZu2epf294ft9R0a4hj+1RrJE80fmhPVWVXHzDlSA3BHfGKg8N3ep31pcT6jPaSbbmaCNbe3aLHlSvGSdztnO0HHGOnNeH6Dr2ufCfxPNpWqwvJYOwaaBTlXU8CaInHOB7ZxtOCPl9t8IzRXGhPPBIksUl/eukiMGVlN1KQQR1BFY5jl7wcG4tShJrllptrpf7v6uOlV9o9dGtzdooorxToCiiigCG7uo7K2e4lWZkTGRDC8rcnHCoCx69hWF4ZeHUNN1GG5s7ndNd3DTre2ckfnRvK4jz5ijePKCLjnChQccCujorWNRRpuNtW1rft/XfsJq7uclot5qem+EDY2+mTX2p6V/oiQtG1otwiSGNHV5AVOY1DnBI5xxkVn6j45vL7T57Ow8Ea/NdXC+THHqFhstyW4/eHcflwec4B6EjqO9orojiqXO5zpptu+7Xy06fc/MhwdrJnmXgjw7efDnwvqV9f2013qd5tKWdmrzfdQlEOxDtYsWBblR8vPrmfCS3k8LWWpy6vZavb3F1JGqwf2TcPhUBw25UI5LkY7bfevYKK6Z5rKrGqqsbuo1dp2+HZLRkqik1boeWfDvw7q6eOtc8UXdhNY2N95/kR3Q2THfMGGU5K4C85x1GMjmupstYgk8X3c4tdUWK5tLWCOR9MuUUusk5YEmMbQBIvJwOevBrqqKxxGO+sTlOpHdJKz2S+++w40+RJJnK+PfBsXjPQRarIkN7A3mW0zKCA2MFWOMhW4zjuFPOME+HGnXmk+A9PsL+3eC6gaZZI36g+c/5gjkEcEEEV1VFYvGVXhvqr+FO68t/wDMr2a5+fqFFFFcpYUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFAH//Z",
"captchaWidth": "110",
"captchaHeight": "47",
"captchaEncryptedAnswer": "767338fffffff8ffffffd6ffffff8d3038ffffffba1971ffffffdfffffffe3f6c9"
}'
data-wipe='{"listener":"click","name":"Detailseite E-Mail","id":"128027562762"}'
>
<i class="icon-mail"></i>
<span class="text" >info#taxi-ac.de</span>
</a>
</li>
<li class="website ">
<a class="link" href="http://www.aachener-airport-taxi.de" rel="follow" target="_blank" title="http://www.aachener-airport-taxi.de"
data-wipe='{"listener":"click","name":"Detailseite Webadresse","id":"128027562762"}'>
<i class="icon-website"></i>
<span class="text">Zur Website</span>
</a>
</li>
</ul>
</div>
I'm trying to get info#taxi-ac.de and http://www.aachener-airport-taxi.de out of there.
soup.find(class='email') obviously doesn't work because class makes the compiler think that I want to declare one inside the brackets. While I can use
for link in soup.find_all('a'):
print(link.get('href')) to get ALL the links in there, I want this specific one. The links are always different, so I can't regex for them, so I guess one would have to navigate through the html-body by hand.

print(soup.find("span",{"class":"text"}).text)
print(soup.find(attrs={"class":"website"}).a["href"])
info#taxi-ac.de
http://www.aachener-airport-taxi.de

Related

Getting repeats in beautifulsoup nested tags

I'm trying to parse through html using beautifulsoup (being called with lxml).
On nested tags I'm getting repeated text
I've tried going through and only counting tags that have no children, but then I'm losing out on data
given:
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments</span>
</li>
</ul>
</div>
and running:
soup = BeautifulSoup(file_info, features = "lxml")
soup.prettify().encode("utf-8")
for tag in soup.find_all(True):
if check_text(tag.text): #false on empty string/ all numbers
print (tag.text)
I get "to post comments" 4 times.
Is there a beautifulsoup way of just getting the result once?

Given an input like
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments1</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments2</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments3</span>
</li>
</ul>
</div>
You could do something like
[x.span.string for x in soup.find_all("li", class_="comment_forbidden first last")]
which would give
[' to post comments1', ' to post comments2', ' to post comments3']
find_all() is used to find all the <li> tags of class comment_forbidden first last and the <span> child tag of each of these <li> tag's content is obtained using their string attribute.

For anyone struggling with this, try swapping out the parser. I switched to html5lib and I no longer have repetitions. It is a costlier parser though, so may cause performance issues.
soup = BeautifulSoup(html, "html5lib")
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

You can use find() instead of find_all() to get your desired result only once

python selenium get value of specific element

<li id="button1" class="on">
<div class="supply1">
<div class="buildingimg">
<a class="fastBuild tooltip js_hideTipOnMobile" title="Metallmine auf Stufe 4 ausbauen" href="javascript:void(0);" onclick="sendBuildRequest('https://s159-de.ogame.gameforge.com/game/index.php?page=resources&modus=1&type=1&menge=1&token=0c86d8a8bf9a5c559538b0e13cb462b4', null, 1);">
<img src="https://gf2.geo.gfsrv.net/cdndf/3e567d6f16d040326c7a0ea29a4f41.gif" width="22" height="14">
</a>
<a class="detail_button tooltip js_hideTipOnMobile slideIn" title="" ref="1" id="details" href="javascript:void(0);">
<span class="ecke">
<span class="level">
<span class="textlabel">
**Metallmine**
</span>
**3** </span>
</span>
</a>
</div>
</div>
</li>
<li id="button2" class="on">
<div class="supply2">
<div class="buildingimg">
<a class="fastBuild tooltip js_hideTipOnMobile" title="" href="javascript:void(0);" onclick="sendBuildRequest('https://s159-de.ogame.gameforge.com/game/index.php?page=resources&modus=1&type=2&menge=1&token=0c86d8a8bf9a5c559538b0e13cb462b4', null, 1);">
<img src="https://gf2.geo.gfsrv.net/cdndf/3e567d6f16d040326c7a0ea29a4f41.gif" width="22" height="14">
</a>
<a class="detail_button tooltip js_hideTipOnMobile slideIn" title="" ref="2" id="details" href="javascript:void(0);">
<span class="ecke">
<span class="level">
<span class="textlabel">
**Kristallmine**
</span>
**1** </span>
</span>
</a>
</div>
</div>
</li>
Dear Community,
So I want to create a bot for a browser game (just for learning purposes of course). In the game you can build and level up metal and crystall mines to get more resources. To have the best resource proportions it is best to have a metal mine which is always 2 levels higher, than your crystal mine. Writing the code to compare the levels is no problem, but I'm having problems accessing the actual values of the "level" of the mine since there is no unique attribute to them.
Above in the code you can see the "Metallmine" and "Kristallmine" and the corresponding levels. I would like to write a code similar to:
if LevelOfKristallmine - LevelOfMetallmine <-2
driver.find_element_by_whatever('upgradebutton').click()
how can I get the values of LevelOfKristallmine and LevelOfMetallmine?
Thanks alot for your answers!

You are trying to use the ID, I assume as the values? Instead copy and paste the XPath, using something like:
driver.find_element_by_xpath('*//*[#id="example-xpath"]/div/nav/ol*').click()
To copy Xpath, f12, find the element to click, right click, copy > Xpath. Then paste in the parentheses. Follow this other link and you should figure it out mate.

How to scrape tags that appear within a script

My intention is to scrape the names of the top-selling products on Ali-Express.
I'm using the Requests library alongside Beautiful Soup to accomplish this.
# Remember to import BeautifulSoup, requests and pprint
url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found
all_items = soup.find_all('li',class_= 'top10-item')
pp.pprint(all_items)
# []
However this returns an empty list, indicating that soup_find_all() did not find any tags fitting that criteria.
Inspect Element in Chrome displays the list items as such
.
However in source code (ul class = "top10-items") contains a script, which seems to iterate through each list item (I'm not familiar with HTML).
<div class="container">
<div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
<ul class="top10-items loading" id="bestselling-top10">
</ul>
<script class="X-template-top10" type="text/mustache-template">
{{#topList}}
<li class="top10-item">
<div class="rank-orders">
<span class="rank">{{rank}}</span>
<span class="orders">{{productOrderNum}}</span>
</div>
<div class="img-wrap">
<a href="{{productDetailUrl}}" target="_blank">
<img src="{{productImgUrl}}" alt="{{productName}}">
</a>
</div>
<a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
<p class="item-price">
<span class="price">US ${{productMinPrice}}</span>
<span class="uint">/ {{productUnitType}}</span>
</p>
</li>
{{/topList}}</script>
</div>
</div>
So this probably explains why soup.find_all() doesn't find the "li" tag.
My question is: How can I extract the item names from the script using Beautiful soup?

How to implement following-sibling axis of xpath alternative in Beautifulsoup Python

I'm trying to collect the text using Bs4, selenium and Python I want to get the text "Lisa Staprans" using:
name = str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").div.get_text().encode("utf-8"))[2:-1]
Here is the code:
<div class="profile-about-right">
<div class="text-bold">
SF Peninsula Interior Design Firm
<br/>
Best of Houzz 2015
</div>
<br/>
<div class="page-tags" style="display:none">
page_type: pro_plus_profile
</div>
<div class="pro-info-horizontal-list text-m text-dt-s">
<div class="info-list-label">
<i class="hzi-font hzi-Ruler">
</i>
<div class="info-list-text">
<span class="hide" itemscope="" itemtype="http://data-vocabulary.org/Breadcr
umb">
<a href="http://www.houzz.com/professionals/c/Menlo-Park--CA" itemprop="url
">
<span itemprop="title">
Professionals
</span>
</a>
</span>
<span itemprop="child" itemscope="" itemtype="http://data-vocabulary.org/Bre
adcrumb">
<a href="http://www.houzz.com/professionals/interior-designer/c/Menlo-Park-
-CA" itemprop="url">
<span itemprop="title">
Interior Designers & Decorators
</span>
</a>
</span>
</div>
</div>
<div class="info-list-label">
<i class="hzi-font hzi-Man-Outline">
</i>
<div class="info-list-text">
<b>
Contact
</b>
: Lisa Staprans
</div>
</div>
</div>
</div>
Please let me know how it would be.

I assumed you are using Beautifulsoup since you are using class_ attribute dictionary-
If there is one div with class name hzi-font hzi-Man-Outline then try-
str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").findNext('div').get_text().split(":")[-1]).strip()
Extracts 'Lisa Staprans'
Here findNext navigates to next div and extracts text.

I can't test it right now but I would do :
profilePageSource.find_element_by_class_name("info-list-text").get_attribute('innerHTML')
Then you will have to split the result considering the : (if it's always the case).
For more informations : https://selenium-python.readthedocs.org/en/latest/navigating.html

Maybe something is wrong with this part:
find(class_="hzi-font hzi-Man-Outline")
An easy way to get the right information can be: right click on the element you need in the page source by inspecting it with Google Chrome, copy the xpath of the element, and then use:
profilePageSource.find_element_by_xpath(<xpath copied from Chorme>).text
Hope it helps.

Extract link in beautiful soup [duplicate]

This question already has answers here:
retrieve links from web page using python and BeautifulSoup [closed]
(16 answers)
Closed 7 years ago.
I am new to beautiful soup and am trying to figure out how to pull a website from a nested array. The website can be found twice under the "track-visit-website" class.
This is NOT a duplicate of the question asking about how to pull hrefs. I've done that successfully on this page. I am trying to isolate the actual company website.
I've tried several codes, but can't get it to work. Here is an example:
print(item.contents[2].find_all("a", {"class": "track-visit-website"})[0].a)
The site is YP.com Septic Search
Here's the code from the one of the items on the site:
<div class="info">
<h3 class="n">
<div class="info-section info-primary">
<p class="adr" itemprop="address" itemtype="http://schema.org/PostalAddress" itemscope="">
<span class="street-address" itemprop="streetAddress">2806 Farview Dr</span>
<span class="locality" itemprop="addressLocality">Fort Collins, </span>
<span itemprop="addressRegion">CO</span>
<span itemprop="postalCode">80524</span>
</p>
<div class="phones phone primary" itemprop="telephone">(970) 829-0852</div>
</div>
<div class="info-section info-secondary">
<div class="categories">
<div class="links">
<a class="track-visit-website" data-analytics="{"click_id":6,"act":2,"dku":"http://www.affordablesepticanddraincleaning.com","FL":"url","TL":"off","target":"website","LOC":"http://www.affordablesepticanddraincleaning.com"}" target="_blank" rel="nofollow" href="http://www.affordablesepticanddraincleaning.com" data-impressed="1">Website</a>
<a class="track-map-it directions" data-analytics="{"click_id":13,"target":"website","act":4}" href="/listings/1000775636908/directions" data-impressed="1">Directions</a>
<a class="track-more-info" data-analytics="{"click_id":7,"target":"moreInfo","act":1,"FL":"list"}" href="/fort-collins-co/mip/affordable-septic-drain-cleaning-llc-505109997?lid=1000775636908" data-impressed="1">More Info</a>
</div>

Copy this code snippet to a python file and run it
import re
content = """
<div class="info">
<h3 class="n">
<div class="info-section info-primary">
<p class="adr" itemprop="address" itemtype="http://schema.org/PostalAddress" itemscope="">
<span class="street-address" itemprop="streetAddress">2806 Farview Dr</span>
<span class="locality" itemprop="addressLocality">Fort Collins, </span>
<span itemprop="addressRegion">CO</span>
<span itemprop="postalCode">80524</span>
</p>
<div class="phones phone primary" itemprop="telephone">(970) 829-0852</div>
</div>
<div class="info-section info-secondary">
<div class="categories">
<div class="links">
<a class="track-visit-website" data-analytics="{"click_id":6,"act":2,"dku":"http://www.affordablesepticanddraincleaning.com","FL":"url","TL":"off","target":"website","LOC":"http://www.affordablesepticanddraincleaning.com"}" target="_blank" rel="nofollow" href="http://www.affordablesepticanddraincleaning.com" data-impressed="1">Website</a>
<a class="track-map-it directions" data-analytics="{"click_id":13,"target":"website","act":4}" href="/listings/1000775636908/directions" data-impressed="1">Directions</a>
<a class="track-more-info" data-analytics="{"click_id":7,"target":"moreInfo","act":1,"FL":"list"}" href="/fort-collins-co/mip/affordable-septic-drain-cleaning-llc-505109997?lid=1000775636908" data-impressed="1">More Info</a>
</div>
"""
websites = set(re.findall(r'http://[a-zA-Z0-9\.]*\.[a-z]{2,}',content)) # find all urls in the content
websites = list(websites)
print(websites) # or in python2 => print websites
Now find a way to incorporate that into your code, get the html, save it as content, regex it and save to file
Web scraping you have to know regex
read up on regex, a good tutorial is here regex tutorial

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data with BeautifulSoup - python

print(soup.find("span",{"class":"text"}).text) print(soup.find(attrs={"class":"website"}).a["href"]) info#taxi-ac.de http://www.aachener-airport-taxi.de

Related

Getting repeats in beautifulsoup nested tags

python selenium get value of specific element

How to scrape tags that appear within a script

How to implement following-sibling axis of xpath alternative in Beautifulsoup Python

Extract link in beautiful soup [duplicate]

Categories

Resources