I'm using the following code to retrieve all the image links on a webpage
from bs4 import BeautifulSoup
import requests
def get_txt(soup, key):
key_tag = soup.find('span', text=re.compile(key)).parent
return key_tag.find_all('span')[1].text
urldes = "https://www.johnpyeauctions.co.uk/lot_list.asp?saleid=4709&siteid=1"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(urldes, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
image_links = [x['data-img'] for x in soup.find_all('a', rel='popover')]
for link in image_links:
print(link)
I would like to apply the same principle in order to retrieve the text description that goes with each image:
soup.find_all(width='41%')
for text in soup.find_all('h5'):
print(text)
This code retrieves all the <h5> tags BUT not the specific tag with the parent (width='41%').
I have tried to apply the same loop as above for the image links:
image_text = [x['h5'] for x in soup.find_all(width='41%')]
for text in image_text:
print(text)
But I get the following error:
`Traceback (most recent call last):
File "C:\Users\alexa\Desktop\jpye_v2.py", line 41, in <module>
image_text = [x['h5'] for x in soup.find_all(width='41%')]
File "C:\Users\alexa\Desktop\jpye_v2.py", line 41, in <listcomp>
image_text = [x['h5'] for x in soup.find_all(width='41%')]
File "C:\Python36\lib\site-packages\beautifulsoup4-4.6.0-py3.6.egg\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'h5'`
What I don't understand is why the tag h5 gives an error where the tag a does not or can I not use the same loop to index the text iterations in the same way as the image links?
First of all, simply writing this line soup.find_all(width='41%'), doesn't do anything. The find_all() method returns a list of all the matching tags. So, you'll have to store that in a variable first, and then iterate over it.
For your second code, tag['attribute'] is used to get the value of the attribute for the tag. So, using x['h5'] will raise a KeyError since h5 is not a attribute, but a tag.
Finally, to get the text that you want, you can use this:
for tag in soup.find_all('td', width='41%'):
image_text = tag.find('h5').text
print(image_text)
Or, to show how the find_all() method works, you can check this:
tags = soup.find_all('td', width='41%')
for tag in tags:
image_text = tag.find('h5').text
print(image_text)
Partial Output:
GUESS C0001G1 GENTS ROSE GOLD TONE AND BLUE BRUSH POLISHED SMART WATCH WITH VOICE COMMAND. FITTED WITH A BLUE SMOOTH SILICONE STRAP.BOXED AND PAPERS. RRP £259.00
GUESS I14503L1 LADIES SPORT WATCH WITH POLISHED SILVER COLOUR CASE WITH CRYSTALS, SILVER DIAL AND POLISHED SILVER COLOUR BRACELET. RRP £159
GUESS W0111L2 LADIES WATCH. POLISHED GOLD COLOUR CASE WITH CRYSTALS AND GOLD COLOUR MULTI-FUNCTION DIAL AND BRACELET. RRP £189
GUESS W0072L3 LADIES TREND WATCH. POLISHED ROSE GOLD CASE WITH CRYSTALS AND ROSE GOLD DIAL. POLISHED ROSE GOLD MULTI-CHAIN BRACELET WITH ADJUSTING G-LINK. RRP £159
GUESS W0330L2 LADIES SPORT WATCH. POLISHED ROSE GOLD COLOUR CASE WITH ROSE GOLD COLOUR CHRONO LOOK MULTI FUNCTION DIAL AND ROSE GOLD COLOUR BRACELET. RRP £169
GUESS W13573L1 LADIES SPORT WATCH. POLISHED GOLD COLOURED CASE WITH CRYSTAL AND WHITE MULTI FUNCTION DAIL AND POLISHED GOLD COLOURED BRACELET. RRP £169
GUESS W0674G6 MENS WATCH. ROSE GOLD CASE WITH BLACK TRIM AND SUN/ BLACK MULTI FUNCTION DIAL AND BLACK CROCODILE STYLE LEATHER BRACELET. RRP £169
GUESS W0564L1 LADIES SPORT WATCH. ROSE GOLD COLOUR CASING WITH BLUE TRIM AND CRYSTALS, WHITE MULTI FUNCTION DIAL WITH SMOOTH SILICONE STRAP. RRP £149
GUESS W0425L3 LADIES SPORT WATCH. POLISHED ROSE GOLD/ ANIMAL PRINT CASE AND SUN ROSE GOLD AND ANIMAL DAIL WITH POLISHES ROSE GOLD AND ANIMAL PRINT BRACELET. RRP £189
...
width=41% is an attribute. This will get you closer to what you want:
for text in soup.find_all('td', {'width': '41%'}):
print(text)
Related
I'm trying to get Chips names from this Target market link and trying to get all 28 chips automatically in first page. I wrote this code. Opens the link, scrolls down (to fetch the names and pictures) and tries to get the names;
import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager as CM
options = webdriver.ChromeOptions()
options.add_argument("--log-level=3")
mobile_emulation = {
"userAgent": 'Mozilla/5.0 (Linux; Android 4.0.3; HTC One X Build/IML74K) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/83.0.1025.133 Mobile Safari/535.19'
}
options.add_experimental_option("mobileEmulation", mobile_emulation)
bot = webdriver.Chrome(executable_path=CM().install(), options=options)
bot.get('https://www.target.com/c/chips-snacks-grocery/-/N-5xsy7')
bot.set_window_size(500, 950)
time.sleep(5)
for i in range(0,3):
ActionChains(bot).send_keys(Keys.END).perform()
time.sleep(1)
product_names = bot.find_elements_by_class_name('Link-sc-1khjl8b-0 styles__StyledTitleLink-mkgs8k-5 kdCHb inccCG h-display-block h-text-bold h-text-bs flex-grow-one')
hrefList = []
for e in product_names:
hrefList.append(e.get_attribute('href'))
for href in hrefList:
print(href)
When I inspect the names from browser, the common part of all chips is having Link-sc-1khjl8b-0 styles__StyledTitleLink-mkgs8k-5 kdCHb inccCG h-display-block h-text-bold h-text-bs flex-grow-one class name. So as you see I added find_elements_by_class_name('Link-sc-1khjl8b-0 styles__StyledTitleLink-mkgs8k-5 kdCHb inccCG h-display-block h-text-bold h-text-bs flex-grow-one') line. But it gives null result. What is wrong? Can you help me? Solution can be selenium or bs4 doesnt matter.
You can get all that data from the api as long as you feed in the correct key.
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1'
payload = {
'key': 'ff457966e64d5e877fdbad070f276d18ecec4a01',
'category': '5xsy7',
'channel': 'WEB',
'count': '28',
'default_purchasability_filter': 'true',
'include_sponsored': 'true',
'offset': '0',
'page': '/c/5xsy7',
'platform': 'desktop',
'pricing_store_id': '1771',
'scheduled_delivery_store_id': '1771',
'store_ids': '1771,1768,1113,3374,1792',
'useragent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
'visitor_id': '0179C80AE1090201B5D5C1D895ADEA6C'}
jsonData = requests.get(url, params=payload).json()
for each in jsonData['data']['search']['products']:
title = each['item']['product_description']['title']
buy_url = each['item']['enrichment']['buy_url']
image_url = each['item']['enrichment']['images']['primary_image_url']
print(title)
Output:
Ruffles Cheddar & Sour Cream Potato Chips - 2.5oz
Doritos 3D Crunch Chili Cheese Nacho - 6oz
Hippeas Vegan White Cheddar Organic Chickpea Puffs - 5oz
PopCorners Spicy Queso - 7oz
Doritos 3D Crunch Spicy Ranch - 6oz
Pringles Snack Stacks Variety Pack Potato Crisps Chips - 12.9oz/18ct
Frito-Lay Variety Pack Flavor Mix - 18ct
Doritos Nacho Cheese Chips - 9.75oz
Hippeas Nacho Vibes Organic Chickpea Puffs - 5oz
Tostitos Scoops Tortilla Chips -10oz
Ripple Potato Chips Party Size - 13.5oz - Market Pantry™
Ritz Crisp & Thins Cream Cheese & Onion Potato And Wheat Chips - 7.1oz
Pringles Sour Cream & Onion Potato Crisps Chips - 5.5oz
Original Potato Chips Party Size - 15.25oz - Market Pantry™
Organic White Corn Tortilla Chips - 12oz - Good & Gather™
Sensible Portions Sea Salt Garden Veggie Straws - 7oz
Traditional Kettle Chips - 8oz - Good & Gather™
Lay's Classic Potato Chips - 8oz
Cheetos Crunchy Flamin Hot - 8.5oz
Sweet Potato Kettle Chips - 7oz - Good & Gather™
SunChips Harvest Cheddar Flavored Wholegrain Snacks - 7oz
Frito-Lay Variety Pack Classic Mix - 18ct
Doritos Cool Ranch Chips - 10.5oz
Lay's Wavy Original Potato Chips - 7.75oz
Frito-Lay Variety Pack Family Fun Mix - 18ct
Cheetos Jumbo Puffs - 8.5oz
Frito-Lay Fun Times Mix Variety Pack - 28ct
Doritos Nacho Cheese Flavored Tortilla Chips - 15.5oz
Lay's Barbecue Flavored Potato Chips - 7.75oz
SunChips Garden Salsa Flavored Wholegrain Snacks - 7oz
Pringles Snack Stacks Variety Pack Potato Crisps Chips - 12.9oz/18ct
Frito-Lay Variety Pack Doritos & Cheetos Mix - 18ct
This also works:
product_names = bot.find_elements_by_xpath("//li[#data-test='list-entry-product-card']")
hrefList = []
for e in product_names:
print(e.find_element_by_css_selector("a").get_attribute("href"))
Try instead
product_names = bot.find_elements_by_css_selector('Link-sc-1khjl8b-0.styles__StyledTitleLink-mkgs8k-5.kdCHb.inccCG.h-display-block.h-text-bold.h-text-bs.flex-grow-one')
When using find_elements_by_class_name(), spaces in the class name are not handled properly.
Except that selector doesn't work for me, I need to use '.Link-sc-1khjl8b-0.ItemLink-sc-1eyz3ng-0.kdCHb.dtKueh'
I'm trying to get some datas from Walmart using Python and BeautifulSoup bs4.
Simply I wrote a code for get the all category names and that works:
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.walmart.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
r = requests.get('https://www.walmart.com/all-departments')
soup = BeautifulSoup(r.content, 'lxml')
sub_list = soup.find_all('div', class_='alldeps-DepartmentNav-link-wrapper display-inline-block u-size-1-3')
print(sub_list)
The problem is; when I try to get the values from this link by using the code below, I get empty results:
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.walmart.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
r = requests.get('https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391')
soup = BeautifulSoup(r.content, 'lxml')
general_list = soup.find_all('a', class_='product-title-link line-clamp line-clamp-2 truncate-title')
print(general_list)
As I searched on old docs, I see only SerpApi solution but it is paid solution so is there any way for get the values? Or am I doing something wrong?
Here is good tutotial for Selenium:
https://selenium-python.readthedocs.io/getting-started.html#simple-usage.
I've wrote a short script for you to get started. All you need is to download chromedriver(Chromium) and put it to path. For Windows, chromedriver will have .exe resolution
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.get("https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391")
assert "Walmart.com" in driver.title
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".product-title-link.line-clamp.line-clamp-2.truncate-title>span")))
elems = driver.find_elements_by_css_selector(".product-title-link.line-clamp.line-clamp-2.truncate-title>span")
for el in elems:
print(el.text)
driver.close()
My output:
Lance Sandwich Cookies, Nekot Lemon Creme, 8 Ct Box
Nature Valley Biscuits, Almond Butter Breakfast Biscuits w/ Nut Filling, 13.5 oz
Pepperidge Farm Soft Baked Strawberry Cheesecake Cookies, 8.6 oz. Bag
Nutter Butter Family Size Peanut Butter Sandwich Cookies, 16 oz
SnackWell's Devil's Food Cookie Cakes 6.75 oz. Box
Munk Pack Protein Cookies, Variety Pack, Vegan, Gluten Free, Dairy Free Snacks, 6 Count
Great Value Twist & Shout Chocolate Sandwich Cookies, 15.5 Oz.
CHIPS AHOY! Chewy Brownie Filled Chocolate Chip Cookies, 9.5 oz
Nutter Butter Peanut Butter Wafer Cookies, 10.5 oz
Nabisco Sweet Treats Cookie Variety Pack OREO, OREO Golden & CHIPS AHOY!, 30 Snack Packs (2 Cookies Per Pack)
Archway Cookies, Soft Dutch Cocoa, 8.75 oz
OREO Double Stuf Chocolate Sandwich Cookies, Family Size, 20 oz
OREO Chocolate Sandwich Cookies, Party Size, 25.5 oz
Fiber One Soft-Baked Cookies, Chocolate Chunk, 6.6 oz
Nature Valley Toasted Coconut Biscuits with Coconut Filling, 10 ct, 13.5 oz
Great Value Duplex Sandwich Creme Cookies Family Size, 25 Oz
Great Value Assorted Sandwich creme Cookies Family Size, 25 oz
CHIPS AHOY! Original Chocolate Chip Cookies, Family Size, 18.2 oz
Archway Cookies, Crispy Windmill, 9 oz
Nabisco Classic Mix Variety Pack, OREO Mini, CHIPS AHOY! Mini, Nutter Butter Bites, RITZ Bits Cheese, Easter Snacks, 20 Snack Packs
Mother's Original Circus Animal Cookies 11 oz
Lotus Biscoff Cookies, 8.8 Oz.
Archway Cookies, Crispy Gingersnap, 12 oz
Great Value Vanilla Creme Wafer Cookies, 8 oz
Pepperidge Farm Verona Strawberry Thumbprint Cookies, 6.75 oz. Bag
Absolutely Gluten Free Coconut Macaroons
Sheila G's Brownie Brittle GLUTEN-FREE Chocolate Chip Cookie Snack Thins, 4.5oz
CHIPS AHOY! Peanut Butter Cup Chocolate Cookies, Family Size, 14.25 oz
Great Value Lemon Sandwich Creme Cookies Family Size, 25 oz
Keebler Sandies Classic Shortbread Cookies 11.2 oz
Nabisco Cookie Variety Pack, OREO, Nutter Butter, CHIPS AHOY!, 12 Snack Packs
OREO Chocolate Sandwich Cookies, Family Size, 19.1 oz
Lu Petit Ecolier European Dark Chocolate Biscuit Cookies, 45% Cocoa, 5.3 oz
Keebler Sandies Pecan Shortbread Cookies 17.2 oz
CHIPS AHOY! Reeses Peanut Butter Cup Chocolate Chip Cookies, 9.5 oz
Fiber One Soft-Baked Cookies, Oatmeal Raisin, 6 ct, 6.6 oz
OREO Dark Chocolate Crme Chocolate Sandwich Cookies, Family Size, 17 oz
Pinwheels Pure Chocolate & Marshmallow Cookies, 12 oz
Keebler Fudge Stripes Original Cookies 17.3 oz
Pepperidge Farm Classic Collection Cookies, 13.25 oz. Box
It's because the website is dynamically rendered. So the javascript first need to run before it shows the product. Therefore you need somewhere to run the javascript (bs can't do that) Have a look at the selinium library.
I'm trying to write some code to extract some data from transfermarkt (Link Here for the page I'm using). I'm stuck trying to print the clubs. I've figured out that I need to access h2 and then the a class in order to just get the text. The HTML code is below
<div class="table-header" id="to-349"><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018"><img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" /></a><h2><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">Barnsley FC</a></h2></div>
so you can see if I just try find_all("a", "class": "vereinprofil_tooltip"}) it doesn't work properly as it also returns the image file which has no plain text? But if I can search for h2 first and then search find_all("a", "class": "vereinprofil_tooltip"}) within the returned h2 it would get me what I want. My code is below.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
Club = Clubs.find("a", {"class": "vereinprofil_tooltip"})
print(Club)
I get the error in getattr
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I know what the error means but I've been going round in circles trying to find a way of actually doing it properly and getting what I want. Any help is appreciated.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/
plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
print(type(Clubs)) # this can be removed, but I left it to expose how I figured this out
for club in Clubs:
print(club.text)
Basically: Clubs is a list (technically, a ResultSet, but the behavior is very similar), you need to iterate it as such. .text gives just the text, other attributes could be retrieved as well.
Output looks like:
Transfer record 18/19
Barnsley FC
Burton Albion
Sunderland AFC
Shrewsbury Town
Scunthorpe United
Charlton Athletic
Plymouth Argyle
Portsmouth FC
Peterborough United
Southend United
Bradford City
Blackpool FC
Bristol Rovers
Fleetwood Town
Doncaster Rovers
Oxford United
Gillingham FC
AFC Wimbledon
Walsall FC
Rochdale AFC
Accrington Stanley
Luton Town
Wycombe Wanderers
Coventry City
Transfer record 18/19
There are, however, a bunch of blank lines (I.e., .text was '') that you should probably handle as well.
my guess is you might mean findAll instead of find_all
I tried this code below and it works
content = """<div class="table-header" id="to-349">
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
<img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" />
</a>
<h2>
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
Barnsley FC
</a>
</h2>
</div>"""
soup = BeautifulSoup(content, 'html.parser')
#get main_box
main_box = soup.findAll('a', {'class': 'vereinprofil_tooltip'})
#print(main_box)
for main_text in main_box: # looping thru the list
if main_text.text.strip(): # get the body text
print(main_text.text.strip()) # print it
output is
Barnsley FC
I'll edit this with a reference to the documentation about findAll. cant remember it on to pof my head
edit:
did a look at the documentation, turns out find_all = findAll..
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
now I feel dumb lol
I am writing a program to iterate through a recipe website, the Woks of Life, and extract each recipe and store it in a CSV file. I have managed to extract the links for storage purpose, but I am having trouble extracting the elements on the page. The website link is https://thewoksoflife.com/baked-white-pepper-chicken-wings/. The elements that I am trying to reach are the name, cook time, ingredients, calories, instructions, etc.
def parse_recipe(link):
#hardcoded link for now until i get it working
page = requests.get("https://thewoksoflife.com/baked-white-pepper-chicken-wings/")
soup = BeautifulSoup(page.content, 'html.parser')
for i in soup.findAll("script", {"class": "yoast-schema-graph yoast-schema-graph--main"}):
print(i.get("name")) #should print "Baked White Pepper Chicken Wings" but prints "None"
For reference, when I print(i), I get:
<script class="yoast-schema-graph yoast-schema-graph--main" type="application/ld+json">
{"#context":"https://schema.org","#graph":
[{"#type":"Organization","#id":"https://thewoksoflife.com/#organization","name":"The Woks of
Life","url":"https://thewoksoflife.com/","sameAs":
["https://www.facebook.com/thewoksoflife","https://twitter.com/thewoksoflife"],"logo":
{"#type":"ImageObject","#id":"https://thewoksoflife.com/#logo","url":"https://thewoksoflife.com/wp-
content/uploads/2019/05/Temporary-Logo-e1556728319201.png","width":365,"height":364,"caption":"The
Woks of Life"},"image":{"#id":"https://thewoksoflife.com/#logo"}}{"#type":"WebSite","#id":"https://thewoksoflife.com/#website","url":"https://thewoksoflife.com/","name":
"The Woks of Life","description":"a culinary genealogy","publisher":
{"#id":"https://thewoksoflife.com/#organization"},"potentialAction":
{"#type":"SearchAction","target":"https://thewoksoflife.com/?s={search_term_string}","query-
input":"required name=search_term_string"}},
{"#type":"ImageObject","#id":"https://thewoksoflife.com/baked-white-pepper-chicken-
wings/#primaryimage","url":"https://thewoksoflife.com/wp-content/uploads/2019/11/white-pepper-
chicken-wings-9.jpg","width":600,"height":836,"caption":"Crispy Baked White Pepper Chicken Wings,
thewoksoflife.com"},{"#type":"WebPage","#id":"https://thewoksoflife.com/baked-white-pepper-
chicken-wings/#webpage","url":"https://thewoksoflife.com/baked-white-pepper-chicken-
wings/","inLanguage":"en-US","name":"Baked White Pepper Chicken Wings | The Woks of
Life", .................. #continues onwards
I am trying to access the "name" (as well as other similarly unaccessable elements) located at the end of the code snippet above, but am unable to do so.
Any help would be appreciated!
The data is in JSON format, so after locating the <script> tag, you can parse it with JSON module. For exemple:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://thewoksoflife.com/baked-white-pepper-chicken-wings/'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
data = json.loads( soup.select_one('script.yoast-schema-graph.yoast-schema-graph--main').text )
# print(json.dumps(data, indent=4)) # <-- uncomment this to print all data
recipe = next((g for g in data['#graph'] if g.get('#type', '') == 'Recipe'), None)
if recipe:
print('Name =', recipe['name'])
print('Cook Time =', recipe['cookTime'])
print('Ingredients =', recipe['recipeIngredient'])
# ... etc.
Prints:
Name = Baked White Pepper Chicken Wings
Cook Time = PT40M
Ingredients = ['3 pounds whole chicken wings ((about 14 wings))', '1-2 tablespoons white pepper powder ((divided))', '2 teaspoons salt ((divided))', '1 teaspoon Sichuan peppercorn powder ((optional))', '2 teaspoons vegetable oil ((plus more for brushing))', '1/2 cup all purpose flour', '1/4 cup cornstarch']
For some reason I am unable to extract the table from this simple html table.
from bs4 import BeautifulSoup
import requests
def main():
html_doc = requests.get(
'http://www.wolfson.cam.ac.uk/old-site/cgi/catering-menu?week=0;style=/0,vertical')
soup = BeautifulSoup(html_doc.text, 'html.parser')
table = soup.find('table')
print table
if __name__ == '__main__':
main()
I have the table, but I cannot understand the beautifulsoup documentation well enough to know how to extract the data. The data are in tr tags.
The website shows a simple HTML food menu.
I would like to output the day of the week and the menu for that day:
Monday:
Lunch: some_lunch, Supper: some_food
Tuesday:
Lunch: some_lunch, Supper: some_supper
and so on for all the days of the week. 'Formal Hall' can be ignored.
How can I iterate over the tr tags so that I can create this output?
I normally don't provide direct solutions. You should've tried some code and if you face any issue then post it here. But anyways, this is what I've written and it should help in giving you a head start.
soup = BeautifulSoup(r.content)
rows = soup.findAll("tr")
for i in xrange(1,8):
row = rows[i]
print row.find("th").text
for j in xrange(0,2):
print rows[0].findAll("th")[j+1].text.strip(), ": ",
td = row.findAll("td")[j]
for p in td.findAll("p"):
print p.text, ",",
print
print
Output will look something like this:
Monday
Lunch: Leek and Potato Soup, Spaghetti Bolognese with Garlic Bread, Red Pepper and Chickpea Stroganoff with Brown Rice, Chicken Goujons with Garlic Mayonnaise Dip, Vegetable Grills with Sweet Chilli Sauce, Coffee and Walnut Sponge with Custard,
Supper: Leek and Potato Soup, Breaded Haddock with Lemon and Tartare Sauce, Vegetable Samosa with Lentil Dahl, Chilli Beef Wraps, Steamed Strawberry Sponge with Custard,
Tuesday
Lunch: Tomato and Basil Soup, Pan-fried Harrisa Spiced Chicken with Roasted Vegetables, Vegetarian Spaghetti Bolognese with Garlic Bread, Jacket Potato with Various Fillings, Apple and Plum Pie with Custard,
Supper: Tomato and Basil Soup, Lamb Tagine with Fruit Couscous, Vegetable Biryani with Naan Bread, Pan-fried Turkey Escalope, Raspberry Shortbread,