web scraping with python/beautiful soup - python

So i am learning how to web scrape. I am currently trying to find all of the social links in this code
<ul class="socials">
<li class="social instagram">
<b>
Instagram:
</b>
<a href="https://www.instagram.com/keithgalli/">
https://www.instagram.com/keithgalli/
</a>
</li>
<li class="social twitter">
<b>
Twitter:
</b>
<a href="https://twitter.com/keithgalli">
https://twitter.com/keithgalli
</a>
</li>
<li class="social linkedin">
<b>
LinkedIn:
</b>
<a href="https://www.linkedin.com/in/keithgalli/">
https://www.linkedin.com/in/keithgalli/
</a>
</li>
<li class="social tiktok">
<b>
TikTok:
</b>
<a href="https://www.tiktok.com/#keithgalli">
https://www.tiktok.com/#keithgalli
</a>
</li>
It is clearly the links in the anchor tags but i am having issues with the find_all command and when i try to use it i am only getting back one of the social links. The code im putting in is
href = soup.find_all("a")
print(href)
and the out put is
[keithgalli.github.io/web-scraping/webpage.html]
I am not exactly sure on what i am doing wrong. I thought that if i targeted the href that it would grab all of the hrefs..Any hints or direction would be greatly appreciated.

try this:
for href in soup.find_all("a"):
print(href)

Related

How to scrape a url using a LinkExtractor that isn't a full url?

I'm trying to scrape all the matches for this 2013 tennis tournament:
https://www.oddsportal.com/tennis/argentina/atp-buenos-aires-2013/results/
It has two pages and I'm trying to scrape both of them. However, the HTML doesn't seem to provide the full links:
<div id="pagination">
<a href="#/" x-page="1">
<span class="arrow">|«</span>
</a>
<a href="#/" x-page="1">
<span class="arrow">«</span>
</a>
<span class="active-page">1</span>
<a href="#/page/2/" x-page="2">
<span>2</span>
</a>
<a href="#/page/2/" x-page="2">
<span class="arrow">»</span>
</a>
<a href="#/page/2/" x-page="2">
<span class="arrow">»|</span>
</a>
</div>
When I hover over the link using FireFox then I can see the full url so it's stored somewhere!
How would I go about configuring a LinkExtractor() to scrape both the pages?

How to find ahref link inside a <li>

<div class="body>
<ul class = "graph">
<li>
<a href = "Address one"> Text1
</a>
</li>
<li>
<a href = "Address two"> Text2
</a>
</li>
<li>
<a href = "Address three"> Text3
</a>
</li>
</ul>
</div>
I am doing a web scraping project right now and I am having trouble extracting these ahref links above.
right now I have
from bs4 import BeautifulSoup as soup
import requests
page = requests.get(url)
content = soup(page.content, "html.parser")
I tried using the find_all('a') and get('href') functions but they dont seem to work in this situation.
Hope this helps:
for x in content.find_all('li'):
href = x.find('a').get('href')
print(href)

Select specific tag on BS4 Python

I have the following HTML
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428007" class="product-size__option product-size__option--no-stock">
I DONT WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
I use this code to get the data
linksize =soup.find_all('li', class_='product-size__option-wrapper')
productsize = []
for size in linksize:
for size_available in size.find_all('a', {'class':['product-size__option']}):
productsize.append(size_available.text.strip())
But it gets both tags, since it shares the same class (product-size__option), how can I get only the information I need?
Thanks
The data you don't want has a CSS class product-size__option--no-stock. You can check if the element does not contain this class, by doing the following check: if 'product-size__option--no-stock' not in size_available.attrs['class']
For example:
from bs4 import BeautifulSoup
html = '''<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428007" class="product-size__option product-size__option--no-stock">
I DONT WANT THIS</a>
</li>'''
soup = BeautifulSoup(html, 'html.parser')
linksize =soup.find_all('li', class_='product-size__option-wrapper')
productsize = []
for size in linksize:
for size_available in size.find_all('a', {'class':['product-size__option']}):
if 'product-size__option--no-stock' not in size_available.attrs['class']:
productsize.append(size_available.text.strip())

Targeting the third list item with beautiful soup

I'm scraping a website with Beautiful Soup and am having trouble trying to target an item in a span tag nested within an li tag. The website I'm trying to scrape is using the same classes for each list items which is making it harder. The HTML looks something like this:
<div class="bigger-container">
<div class="smaller-container">
<ul class="ulclass">
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item">**This is the only tag I want to scrape**</span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
</ul>
My first thought was to try and target it using "nth-of-type() - I found a similar questions here but it hasn't helped. I've tried playing with it for a while now but my code basically looks like this:
import requests
from bs4 import BeautifulSoup
url = 'url of website I'm scraping'
headers = {User-Agent Header}
for page in range(1):
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, features="lxml")
scrape = soup.find_all('div', class_ = 'even_bigger_container_not_included_in_html_above')
for item in scrape:
condition = soup.find('li:nth-of-type(2)', 'span:nth-of-type(1)').text
print(condition)
Any help is greatly appreciated!
To use a CSS Selector, use the select() method, not find().
So to get the third <li>, use li:nth-of-type(3) as a CSS Selector:
from bs4 import BeautifulSoup
html = """<div class="bigger-container">
<div class="smaller-container">
<ul class="ulclass">
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item">**This is the only tag I want to scrape**</span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
</ul>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.select_one("li:nth-of-type(3)").get_text(strip=True))
Output:
**This is the only tag I want to scrape**

Extracting data with BeautifulSoup

This is probably an easy question, but I can't figure it out.
I'm having trouble extracting email and url from this part of a webpage with BeautifulSoup:
<!-- ENDE telefonnummer.jsp --></li>
<li class="email ">
<a
class="link"
href="mailto:info#taxi-ac.de"
data-role="email-layer"
data-template-replacements='{
"name": "Aachener-Airport-Taxi Blum",
"subscriberId": "128027562762",
"captchaBase64": "data:image/jpg;base64,/9j/4AAQSkZJRgABAgAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAAvAG4DASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD02iivPLm58L6d4x1i4nsdLGpRNCsUcsSIFcASG5eUjEQLXCqWPJMfy72IWvyDD4d13JK90r6K/VLurb7nuzlynodcfqvxJ0PTda/se3ivtU1AMyPBp0HmsjKMkHJGTjOducbTnGK1rDw7awanJrF7FaXOsy4DXcdsI9oClQEBLEcE5JYk5xnaFVfGXTxP8JPElxqVxbQ6jaXshja8lGTOMhz8+S0bnng5BIJw+0GvRy3A4fETnHm5pJe6r8vM+uuui+TflqY1akoJO1l9567oPjKz13VJtMOnapp17HCLgQ6hbeUzx7tpZeTwDgc468ZwcM/4WD4ZOp/2bHfTTXh+7DDZzyFxt3ArtQ7lK/MCMgjkcVZ8LeLdK8X6e93pkjgxttlglAWSM9twBPBAyCCR17ggeWaHfw3fx11nU9QS53WTTrEtnbSTZKYgG5UVmxsySeBux06VVDAU6s6yqQlHkjeyet+2q2YSqOKjZp3Z61pnibSdX1C50+0uX+22yh5baeCSGRVPQ7XUEjkdPUeorz/xt8SPEPgzxWthJbaXdWUircR7UkSQxFiNpO4gN8pGcEdDjsKvhy1k8U/GOfxdpjQtpEWcu0yeYf3JhH7sEuu4gsNwXKjPXiuw1zw/Z+J9Y13S7xEIk0y0MUjLuMMm+62uORyCfUZGQeCa1jRwmDxKVVc0eVOSe8W2k1p1W/Tt5icp1Ie7o76eZu2epf294ft9R0a4hj+1RrJE80fmhPVWVXHzDlSA3BHfGKg8N3ep31pcT6jPaSbbmaCNbe3aLHlSvGSdztnO0HHGOnNeH6Dr2ufCfxPNpWqwvJYOwaaBTlXU8CaInHOB7ZxtOCPl9t8IzRXGhPPBIksUl/eukiMGVlN1KQQR1BFY5jl7wcG4tShJrllptrpf7v6uOlV9o9dGtzdooorxToCiiigCG7uo7K2e4lWZkTGRDC8rcnHCoCx69hWF4ZeHUNN1GG5s7ndNd3DTre2ckfnRvK4jz5ijePKCLjnChQccCujorWNRRpuNtW1rft/XfsJq7uclot5qem+EDY2+mTX2p6V/oiQtG1otwiSGNHV5AVOY1DnBI5xxkVn6j45vL7T57Ow8Ea/NdXC+THHqFhstyW4/eHcflwec4B6EjqO9orojiqXO5zpptu+7Xy06fc/MhwdrJnmXgjw7efDnwvqV9f2013qd5tKWdmrzfdQlEOxDtYsWBblR8vPrmfCS3k8LWWpy6vZavb3F1JGqwf2TcPhUBw25UI5LkY7bfevYKK6Z5rKrGqqsbuo1dp2+HZLRkqik1boeWfDvw7q6eOtc8UXdhNY2N95/kR3Q2THfMGGU5K4C85x1GMjmupstYgk8X3c4tdUWK5tLWCOR9MuUUusk5YEmMbQBIvJwOevBrqqKxxGO+sTlOpHdJKz2S+++w40+RJJnK+PfBsXjPQRarIkN7A3mW0zKCA2MFWOMhW4zjuFPOME+HGnXmk+A9PsL+3eC6gaZZI36g+c/5gjkEcEEEV1VFYvGVXhvqr+FO68t/wDMr2a5+fqFFFFcpYUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFAH//Z",
"captchaWidth": "110",
"captchaHeight": "47",
"captchaEncryptedAnswer": "767338fffffff8ffffffd6ffffff8d3038ffffffba1971ffffffdfffffffe3f6c9"
}'
data-wipe='{"listener":"click","name":"Detailseite E-Mail","id":"128027562762"}'
>
<i class="icon-mail"></i>
<span class="text" >info#taxi-ac.de</span>
</a>
</li>
<li class="website ">
<a class="link" href="http://www.aachener-airport-taxi.de" rel="follow" target="_blank" title="http://www.aachener-airport-taxi.de"
data-wipe='{"listener":"click","name":"Detailseite Webadresse","id":"128027562762"}'>
<i class="icon-website"></i>
<span class="text">Zur Website</span>
</a>
</li>
</ul>
</div>
I'm trying to get info#taxi-ac.de and http://www.aachener-airport-taxi.de out of there.
soup.find(class='email') obviously doesn't work because class makes the compiler think that I want to declare one inside the brackets. While I can use
for link in soup.find_all('a'):
print(link.get('href')) to get ALL the links in there, I want this specific one. The links are always different, so I can't regex for them, so I guess one would have to navigate through the html-body by hand.
print(soup.find("span",{"class":"text"}).text)
print(soup.find(attrs={"class":"website"}).a["href"])
info#taxi-ac.de
http://www.aachener-airport-taxi.de

Categories