I am trying to access the elements in the Ingredients list of the following website: https://www.jamieoliver.com/recipes/pasta-recipes/gennaro-s-classic-spaghetti-carbonara/
<div class="col-md-12 ingredient-wrapper">
<ul class="ingred-list ">
<li>
3 large free-range egg yolks
</li>
<li>
40 g Parmesan cheese, plus extra to serve
</li>
<li>
1 x 150 g piece of higher-welfare pancetta
</li>
<li>
200g dried spaghetti
</li>
<li>
1 clove of garlic
</li>
<li>
extra virgin olive oil
</li>
</ul>
</div
I first tried just using requests and beautiful soup but my code didn't find the list elements. I then tried using Selenium and it still didn't work. My code is below:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.jamieoliver.com/recipes/pasta-recipes/cracker-ravioli/"
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
for ultag in soup.findAll('div', {'class': "col-md-12 ingredient-wrapper"}):
# for ultag in soup.findAll('ul', {'class': 'ingred_list '}):
for litag in ultag.findALL('li'):
print(litag.text)
To get the ingredients list, you can use this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.jamieoliver.com/recipes/pasta-recipes/gennaro-s-classic-spaghetti-carbonara/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for li in soup.select('.ingred-list li'):
print(' '.join(li.text.split()))
Prints:
3 large free-range egg yolks
40 g Parmesan cheese , plus extra to serve
1 x 150 g piece of higher-welfare pancetta
200 g dried spaghetti
1 clove of garlic
extra virgin olive oil
Related
I am having trouble with getting the tag from a BeautifulSoup .find() method.
Here is my code:
url = evaluations['href']
page = requests.get(url, headers = HEADERS)
soup = BeautifulSoup(page.content, 'lxml')
evaluators = soup.find("section", class_="main-content list-content")
evaluators_list = evaluators.find("ul", class_='evaluation-list').find_all("li")
evaluators_dict = defaultdict(dict)
for evaluator in evaluators_list:
eval_list = evaluator.find('ul', class_='highlights-list')
print(eval_list.prettify())
This then gives the output:
<ul class="highlights-list">
<li class="eval-meta evaluator">
<b class="uppercase heading">
Evaluated By
</b>
<img alt="Andrew Ivins" height="50" src="https://s3media.247sports.com/Uploads/Assets/680/358/9358680.jpeg?fit=bounds&crop=50:50,offset-y0.50&width=50&height=50&fit=crop" title="Andrew Ivins" width="50"/>
<div class="evaluator">
<b class="text">
Andrew Ivins
</b>
<span class="uppercase">
Southeast Recruiting Analyst
</span>
</div>
</li>
<li class="eval-meta projection">
<b class="uppercase heading">
Projection
</b>
<b class="text">
First Round
</b>
</li>
<li class="eval-meta">
<b class="uppercase heading">
Comparison
</b>
<a href="https://247sports.com/Player/Charles-Woodson-76747/" target="_blank">
Charles Woodson
</a>
<span class="uppercase">
Oakland Raiders
</span>
</li>
</ul>
and the error
Traceback (most recent call last):
File "XXX", line 2, in <module>
player = Player("Travis-Hunter-46084728").player
File "XXX", line 218, in __init__
self.player = self._parse_player()
File "XXX", line 253, in _parse_player
evaluators, background, skills = self._find_scouting_report(soup)
File "XXX", line 468, in _find_scouting_report
print(eval_list.prettify())
AttributeError: 'NoneType' object has no attribute 'prettify'
As you can see it does find the tag and outputs it in a prettify manner but also outputs a None. What can be a way around this? Thank you in advance. The link I am using is: https://247sports.com/PlayerInstitution/Travis-Hunter-at-Collins-Hill-236028/PlayerInstitutionEvaluations/
EDIT: I have used selenium thinking it may be a JS problem but that did not resolve either.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0'
}
def get_soup(content):
return BeautifulSoup(content, 'lxml')
def main(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url)
soup = get_soup(r.content)
goal = [list(x.stripped_strings) for x in soup.select(
'.main-content.list-content > .evaluation-list > li > .highlights-list')]
for i in goal:
print(i[1:3] + i[-2:])
if __name__ == "__main__":
main('https://247sports.com/PlayerInstitution/Travis-Hunter-at-Collins-Hill-236028/PlayerInstitutionEvaluations/')
Output:
['Andrew Ivins', 'Southeast Recruiting Analyst', 'Charles Woodson', 'Oakland Raiders']
['Andrew Ivins', 'Southeast Recruiting Analyst', 'Xavier Rhodes', 'Minnesota Vikings']
['Charles Power', 'National writer', 'Marcus Peters', 'Baltimore Ravens']
I am new to Selenium, Python, and programming in general but I am trying to write a small web scraper. I have encountered a website that has multiple links but their HTML code is not available for me using
soup = bs4.BeautifulSoup(html, "lxml")
The HTML-Code is:
<div class="content">
<div class="vertical_page_list is-detailed">
<div infinite-nodes="true" up-data="{"next":1,"url":"/de/pressemitteilungen?container_contenxt=lg%2C1.0"}">[event]
<ul class="has-no-bottom-margin list-unstyled infinite-nodes--list">
<li class="vertical-page-list--item is-detailed infite-nodes--list-item" style="display: list-item;">
<li class="...>
...
</ul>
</div>
</div>
</div>
But soup only contains this part, missing the li classes:
<div class="content">
<div class="vertical_page_list is-detailed">
<div infinite-nodes="true" up-data="{"next":1,"url":"/de/pressemitteilungen?container_contenxt=lg%2C1.0"}">
<ul class="has-no-bottom-margin list-unstyled infinite-nodes--list">
</ul>
</div>
</div>
</div>
It has somthing to do with the [event] after the div but I can't figure out what to do. My guess was that it is some lazy-loaded code but using
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
or directly moving to the element
actions = ActionChains(driver)
actions.move_to_element(driver.find_element_by_xpath("//div['infinite-nodes=']")).perform()
did not yield any results. This is the code I am using:
# Enable headless firefox for Serenium
options = Options()
#options.headless = True
options.add_argument("--headless")
options.page_load_strategy = 'normal'
driver = webdriver.Firefox(options=options, executable_path=r'C:\bin\geckodriver.exe')
print ("Headless Firefox Initialized")
# Load html source code from webpage
driver = webdriver.PhantomJS(executable_path=r'C:\phantomjs\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get("https://www.volkswagen-newsroom.com/de/pressemitteilungen?container_context=lg%2C1.0")
SCROLL_PAUSE_TIME = 2
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
print("Scrolled down to bottom")
# Extract html code
driver.find_element_by_xpath("//div['infinite-nodes=']").click() #just testing
time.sleep(SCROLL_PAUSE_TIME)
html = driver.page_source.encode('utf-8')
soup = bs4.BeautifulSoup(html, "lxml")
Could anyone help me please?
When you visit the page in a browser, and log your network traffic, every time the page loads (or you press the Mehr Pressemitteilungen anzeigen button) an XHR (XmlHttpRequest) request is made to some kind of API(?) - the response of which is JSON, which also contains HTML. It's this HTML that contains the list-item elements you're looking for. You don't need selenium for this:
def get_article_titles():
import requests
from bs4 import BeautifulSoup as Soup
url = "https://www.volkswagen-newsroom.com/de/pressemitteilungen"
params = {
"container_context": "lg,1.0",
"next": "1"
}
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate",
"user-agent": "Mozilla/5.0",
"x-requested-with": "XMLHttpRequest"
}
while True:
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
data = response.json()
params["next"] = data["next"]
soup = Soup(data["html"], "html.parser")
for tag in soup.select("h3.page-preview--title > a"):
yield tag.get_text().strip()
def main():
from itertools import islice
for num, title in enumerate(islice(get_article_titles(), 10), start=1):
print("{}.) {}".format(num, title))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
1.) Volkswagen Konzern, BASF, Daimler AG und Fairphone starten Partnerschaft für nachhaltigen Lithiumabbau in Chile
2.) Verkehrsausschuss-Vorsitzender Cem Özdemir informiert sich über Transformation im Elektro-Werk in Zwickau
3.) Astypalea: Start der Transformation zur smarten, nachhaltigen Insel
4.) Vor 60 Jahren: Fußball-Legende Pelé zu Besuch im Volkswagen Werk Wolfsburg
5.) Novum unter den Kompakten: Neuer Polo ist mit „IQ.DRIVE Travel Assist“ teilautomatisiert unterwegs
6.) Der neue Tiguan Allspace – ab sofort bestellbar
7.) Volkswagen startet Vertriebsoffensive im deutschen Markt
8.) Vor 70 Jahren: Volkswagen erhält ersten Beirat
9.) „Experience our Volkswagen Way to Zero“ – neue Ausstellung im DRIVE. Volkswagen Group Forum für Gäste geöffnet
10.) Jetzt bestellbar: Der neue ID.4 GTX
>>>
<ul id="name1">
<li>
<li>
<li>
<li>
</ul>
<ul id="name2">
<li>
<li>
<li>
</ul>
Hi. I scraping something website. There are different numbers of li tags in different ul tag names. I think there is something wrong with my method. I want your help.
NOTE:
The bottom part of the code is the code of the image I took from the site
ts1 = brosursoup.find("div", attrs={"id": "name"})
ts2 = ts1.find("ul")
hesap = 0
count2 = len(ts1.find_all('ul'))
if (hesap <= count2):
hesap = hesap + 1
for qwe in ts1.find_all("ul", attrs={"id": f"name{hesap}"}):
for bnm in ts1.find_all("li"):
for klo in ts1.find_all("div"):
tgf = ts1.find("span", attrs={"class": "img_w_v8"})
for abn in tgf.find_all("img"):
picture = abn.get("src")
picturename= abn.get("title")
print(picture + " ------ " + picturename)
You can just find which ul tag you want and then use find_all.
page = '<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>'
soup = BeautifulSoup(page,'lxml')
ul_tag = soup.find('ul', {'id': 'name2'})
li_tags = ul_tag.find_all('li')
for i in li_tags:
print(i.text)
# output
li5
li6
li7
If you are trying to match all ul elements of the form id='nameXXX' then you can use a regular expression as follows:
from bs4 import BeautifulSoup
import re
page = '''<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>'''
soup = BeautifulSoup(page, 'lxml')
for ul in soup.find_all('ul', {'id': re.compile('name\d+')}):
for li in ul.find_all('li'):
print(li.text)
This would display:
li1
li2
li3
li4
li5
li6
li7
Try this
from bs4 import BeautifulSoup
page = """<ul id="name1">
<li>li1</li>
<li>li2</li>
<li>li3</li>
<li>li4</li>
</ul>
<ul id="name2">
<li>li5</li>
<li>li6</li>
<li>li7</li>
</ul>"""
soup = BeautifulSoup(page,'lxml')
ul_tag = soup.find_all('ul', {"id": ["name1", "name2"]})
for i in ul_tag:
print(i.text)
I want to fetch the number 121 from the above code. But the soup object that I am getting is not showing the number.
Link to my Image
[<div class="open_pln" id="pln_1">
<ul>
<li>
<div class="box_check_txt">
<input id="cp1" name="cp1" onclick="change_plan(2,102,2);" type="checkbox"/>
<label for="cp1"><span class="green"></span></label>
</div>
</li>
<li id="li_open"><span>Desk</span> <br/></li>
<li> </li>
</ul>
</div>]
The number 121 for open offices is not inside HTML code, but in the JavaScript. You can use regex to extract it:
import re
import requests
url ='https://www.coworker.com/search/los-angeles/ca/united-states'
htmlpage = requests.get(url).text
open_offices = re.findall(r'var openOffices\s*=\s*(\d+)', htmlpage)[0]
private_offices = re.findall(r'var privateOffices\s*=\s*(\d+)', htmlpage)[0]
print('Open offices: {}'.format(open_offices))
print('Private offices: {}'.format(private_offices))
Prints:
Open offices: 121
Private offices: 40
Without re module:
import requests
from bs4 import BeautifulSoup
url ='https://www.coworker.com/search/los-angeles/ca/united-states'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
searchstr = "var openOffices = "
script = soup.select_one(f"script:contains('{searchstr}')").text
print(script.split(searchstr)[1].split(";")[0])
Output:
121
you have to find all the li attribute using soup like this -
attribute=req["li"]
all_links = soup.find_all(attribute)
for link in all_links:
print(link.text.strip())
I can't figure out how to get the title on the anchor.
Here is my code:
from flask import Flask
import requests
from bs4 import BeautifulSoup
laptops = 'http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops'
def scrape():
page = requests.get('http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops')
soup = BeautifulSoup(page.content, "lxml")
links = soup("a", {"class":"title"})
for link in links:
print(link.prettify())
scrape()
Example of result:
<a class="title" href="/test-sites/e-commerce/allinone/product/251" title="Asus VivoBook X441NA-GA190">
Asus VivoBook X4...
</a>
<a class="title" href="/test-sites/e-commerce/allinone/product/252" title="Prestigio SmartBook 133S Dark Grey">
Prestigio SmartB...
</a>
<a class="title" href="/test-sites/e-commerce/allinone/product/253" title="Prestigio SmartBook 133S Gold">
Prestigio SmartB...
</a>
How do I get the "title"?
Attributes like title are accessible via subscription or the .attrs dictionary on an element:
for link in links:
print(link['title'])
See the BeautifulSoup documentation on Attributes.
For the given URL this produces:
Asus VivoBook X441NA-GA190
Prestigio SmartBook 133S Dark Grey
Prestigio SmartBook 133S Gold
Aspire E1-510
Lenovo V110-15IAP
Lenovo V110-15IAP
Hewlett Packard 250 G6 Dark Ash Silver
# ... etc