Having trouble indexing into recipe with BeautifulSoup - python

I am writing a program to iterate through a recipe website, the Woks of Life, and extract each recipe and store it in a CSV file. I have managed to extract the links for storage purpose, but I am having trouble extracting the elements on the page. The website link is https://thewoksoflife.com/baked-white-pepper-chicken-wings/. The elements that I am trying to reach are the name, cook time, ingredients, calories, instructions, etc.
def parse_recipe(link):
#hardcoded link for now until i get it working
page = requests.get("https://thewoksoflife.com/baked-white-pepper-chicken-wings/")
soup = BeautifulSoup(page.content, 'html.parser')
for i in soup.findAll("script", {"class": "yoast-schema-graph yoast-schema-graph--main"}):
print(i.get("name")) #should print "Baked White Pepper Chicken Wings" but prints "None"
For reference, when I print(i), I get:
<script class="yoast-schema-graph yoast-schema-graph--main" type="application/ld+json">
{"#context":"https://schema.org","#graph":
[{"#type":"Organization","#id":"https://thewoksoflife.com/#organization","name":"The Woks of
Life","url":"https://thewoksoflife.com/","sameAs":
["https://www.facebook.com/thewoksoflife","https://twitter.com/thewoksoflife"],"logo":
{"#type":"ImageObject","#id":"https://thewoksoflife.com/#logo","url":"https://thewoksoflife.com/wp-
content/uploads/2019/05/Temporary-Logo-e1556728319201.png","width":365,"height":364,"caption":"The
Woks of Life"},"image":{"#id":"https://thewoksoflife.com/#logo"}}{"#type":"WebSite","#id":"https://thewoksoflife.com/#website","url":"https://thewoksoflife.com/","name":
"The Woks of Life","description":"a culinary genealogy","publisher":
{"#id":"https://thewoksoflife.com/#organization"},"potentialAction":
{"#type":"SearchAction","target":"https://thewoksoflife.com/?s={search_term_string}","query-
input":"required name=search_term_string"}},
{"#type":"ImageObject","#id":"https://thewoksoflife.com/baked-white-pepper-chicken-
wings/#primaryimage","url":"https://thewoksoflife.com/wp-content/uploads/2019/11/white-pepper-
chicken-wings-9.jpg","width":600,"height":836,"caption":"Crispy Baked White Pepper Chicken Wings,
thewoksoflife.com"},{"#type":"WebPage","#id":"https://thewoksoflife.com/baked-white-pepper-
chicken-wings/#webpage","url":"https://thewoksoflife.com/baked-white-pepper-chicken-
wings/","inLanguage":"en-US","name":"Baked White Pepper Chicken Wings | The Woks of
Life", .................. #continues onwards
I am trying to access the "name" (as well as other similarly unaccessable elements) located at the end of the code snippet above, but am unable to do so.
Any help would be appreciated!

The data is in JSON format, so after locating the <script> tag, you can parse it with JSON module. For exemple:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://thewoksoflife.com/baked-white-pepper-chicken-wings/'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
data = json.loads( soup.select_one('script.yoast-schema-graph.yoast-schema-graph--main').text )
# print(json.dumps(data, indent=4)) # <-- uncomment this to print all data
recipe = next((g for g in data['#graph'] if g.get('#type', '') == 'Recipe'), None)
if recipe:
print('Name =', recipe['name'])
print('Cook Time =', recipe['cookTime'])
print('Ingredients =', recipe['recipeIngredient'])
# ... etc.
Prints:
Name = Baked White Pepper Chicken Wings
Cook Time = PT40M
Ingredients = ['3 pounds whole chicken wings ((about 14 wings))', '1-2 tablespoons white pepper powder ((divided))', '2 teaspoons salt ((divided))', '1 teaspoon Sichuan peppercorn powder ((optional))', '2 teaspoons vegetable oil ((plus more for brushing))', '1/2 cup all purpose flour', '1/4 cup cornstarch']

Related

How to scrape text of a span sibling span?

Hello I'm trying to learn how to web scrape so I started by trying to web scrape my school menu.
Ive come into a problem were I can't get the menu items under a span class but instead get the the word within the same line of the span class "show".
here is a short amount of the html text I am trying to work with
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path=chromedriver.exe')#changed this
driver.get('https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/')
results = []
content = driver.page_source
soups = BeautifulSoup(content, 'html.parser')
element=soups.findAll('span',class_ = 'collapsible-heading-status')
for span in element:
print(span.text)
I have tried to make it into span.span.text but that wouldn't return me anything so can some one give me some pointer on how to extract the info under the collapsible-heading-status class.
Yummy waffles - As mentioned they are gone, but to get your goal an approach would be to select the names via css selectors using the adjacent sibling combinator:
for e in soup.select('.collapsible-heading-status + span'):
print(e.text)
or with find_next_sibling():
for e in soup.find_all('span',class_ = 'collapsible-heading-status'):
print(e.find_next_sibling('span').text)
Example
To get the whole information for each in a structured way you could use:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/")
soup = BeautifulSoup(driver.page_source, 'html.parser')
data = []
for e in soup.select('.nutrition'):
d = {
'meal':e.find_previous('h4').text,
'title':e.find_previous('h5').text,
'name':e.find_previous('span').text,
'description': e.p.text
}
d.update({n.text:n.find_next().text.strip(': ') for n in e.select('h6')})
data.append(d)
data
Output
[{'meal': 'Breakfast',
'title': 'Fresh Inspirations',
'name': 'Vanilla Chia Seed Pudding with Blueberrries',
'description': 'Vanilla chia seed pudding with blueberries, shredded coconut, and toasted almonds',
'Serving Size': '1 serving',
'Calories': '392.93',
'Fat (g)': '36.34',
'Carbohydrates (g)': '17.91',
'Protein (g)': '4.59',
'Allergens': 'Tree Nuts/Coconut',
'Ingredients': 'Coconut milk, chia seeds, beet sugar, imitation vanilla (water, vanillin, caramel color, propylene glycol, ethyl vanillin, potassium sorbate), blueberries, shredded sweetened coconut (desiccated coconut processed with sugar, water, propylene glycol, salt, sodium metabisulfite), blanched slivered almonds'},
{'meal': 'Breakfast',
'title': 'Fresh Inspirations',
'name': 'Housemade Granola',
'description': 'Crunchy and sweet granola made with mixed nuts and old fashioned rolled oats',
'Serving Size': '1/2 cup',
'Calories': '360.18',
'Fat (g)': '17.33',
'Carbohydrates (g)': '47.13',
'Protein (g)': '8.03',
'Allergens': 'Gluten/Wheat/Dairy/Peanuts/Tree Nuts',
'Ingredients': 'Old fashioned rolled oats (per manufacturer, may contain wheat/gluten), sunflower seeds, seedless raisins, unsalted butter, pure clover honey, peanut-free mixed nuts (cashews, almonds, sunflower oil and/or cottonseed oil, pecans, hazelnuts, dried Brazil nuts, salt), light brown beet sugar, molasses'},...]

Get Text from h1 with BeautifulSoup

I was asked to get a product name from a web.
I was asked to get this text:
SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
This is my BeautifulSoup code:
import requests
from bs4 import BeautifulSoup
get = requests.get('https://www.ebay.com/itm/SEIKO-5-AUTOMATIC-MENS-STEEL-VINTAGE-JAPAN-MADE-BLACK-DIAL-WATCH-RUN-ORDER-K/143420840058?epid=18032713872&_trkparms=ispr%3D1&hash=item21648c587a:g:ZzEAAOSw9MRdsI8v&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8')
soup = BeautifulSoup(get.text, 'lxml')
company = soup.select('h1.it-ttl')[0].text.strip()
print(company)
The HTML from the code is:
<h1 class="it-ttl" id="itemTitle" itemprop="name">
<span class="g-hdn">Details about
</span>
SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
</h1>
Instead of the desired text, I get this:
Details about SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
How can I extract only the product name?
import requests
from bs4 import BeautifulSoup
get = requests.get('https://www.ebay.com/itm/SEIKO-5-AUTOMATIC-MENS-STEEL-VINTAGE-JAPAN-MADE-BLACK-DIAL-WATCH-RUN-ORDER-K/143420840058?epid=18032713872&_trkparms=ispr%3D1&hash=item21648c587a:g:ZzEAAOSw9MRdsI8v&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8')
soup = BeautifulSoup(get.text, 'html.parser')
company = soup.select('h1.it-ttl')[0].text.strip()
span_text = soup.select('span.g-hdn')[0].text.strip()
print(company)
print(span_text)
print(company.lstrip(span_text))
Since the span tag is nested in the h1 tag, the necessary step is to extract the span text and remove it from the h1 tag with the lstrip method.

Return empty bracket [ ] when web scraping

I try to print all the titles on nytimes.com. I used requests and beautifulsoup module. But I got empty brackets in the end. The return result is [ ]. How can I fix this problem?
import requests
from bs4 import BeautifulSoup
url = "https://www.nytimes.com/"
r = requests.get(url)
text = r.text
soup = BeautifulSoup(text, "html.parser")
title = soup.find_all("span", "balanceHeadline")
print(title)
I am assuming that you are trying to retrieve the headlines of nytimes. Doing title = soup.find_all("span", {'class':'balancedHeadline'}) will not get you your results. The <span> tag found using the element selector is often misleading. What you have to do is to look into the source code of the page and find the tags wrapped around the title.
For nytimes its a little tricky because the headlines are wrapped in the <script> tag with a lot of junk inside. Hence what you can do is to "clean" it first and deserialize the string by convertinng it into a python dictionary object.
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.nytimes.com/"
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, "html.parser")
scripts = soup.find_all('script')
for script in scripts:
if 'preloadedData' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('=', 1)[1].strip() # remove "window.__preloadedData = "
jsonStr = jsonStr.rsplit(';', 1)[0] # remove trailing ;
jsonStr = json.loads(jsonStr)
for key,value in jsonStr['initialState'].items():
try:
if value['promotionalHeadline'] != "":
print(value['promotionalHeadline'])
except:
continue
outputs
Jeffrey Epstein Autopsy Results Conclude He Hanged Himself
Trump and Netanyahu Put Bipartisan Support for Israel at Risk
Congresswoman Rejects Israel’s Offer of a West Bank Visit
In Tlaib’s Ancestral Village, a Grandmother Weathers a Global Political Storm
Cathay Chief’s Resignation Shows China’s Power Over Hong Kong Unrest
Trump Administration Approves Fighter Jet Sales to Taiwan
Peace Road Map for Afghanistan Will Let Taliban Negotiate Women’s Rights
Debate Flares Over Afghanistan as Trump Considers Troop Withdrawal
In El Paso, Hundreds Show Up to Mourn a Woman They Didn’t Know
Is Slavery’s Legacy in the Power Dynamics of Sports?
Listen: ‘Modern Love’ Podcast
‘The Interpreter’
If You Think Trump Is Helping Israel, You’re a Fool
First They Came for the Black Feminists
How Women Can Escape the Likability Trap
With Trump as President, the World Is Spiraling Into Chaos
To Understand Hong Kong, Don’t Think About Tiananmen
The Abrupt End of My Big-Girl Summer
From Trump Boom to Trump Gloom
What Are Trump and Netanyahu Afraid Of?
King Bibi Bows Before a Tweet
Ebola Could Be Eradicated — But Only if the World Works Together
The Online Mob Came for Me. What Happened to the Reckoning?
A German TV Star Takes On Bullies
Why Is Hollywood So Scared of Climate Change?
Solving Medical Mysteries With Your Help: Now on Netflix
title = soup.find_all("span", "balanceHeadline")
replace it with
title = soup.find_all("span", {'class':'balanceHeadline'})

Difficulty using beautifulsoup in Python to scrape web data from multiple HTML classes

I am using Beautiful Soup in Python to scrape some data from a property listings site.
I have had success in scraping the individual elements that I require but wish to use a more efficient script to pull back all the data in one command if possible.
The difficulty is that the various elements I require reside in different classes.
I have tried the following, so far.
for listing in content.findAll('h2', attrs={"class": "listing-results-attr"}):
print(listing.text)
which successfully gives the following list
15 room mansion for sale
3 bed barn conversion for sale
2 room duplex for sale
1 bed garden shed for sale
Separately, to retrieve the address details for each listing I have used the following successfully;
for address in content.findAll('a', attrs={"class": "listing-results-address"}):
print(address.text)
which gives this
22 Acacia Avenue, CityName Postcode
100 Sleepy Hollow, CityName Postcode
742 Evergreen Terrace, CityName Postcode
31 Spooner Street, CityName Postcode
And for property price I have used this...
for prop_price in content.findAll('a', attrs={"class": "listing-results-price"}):
print(prop_price.text)
which gives...
$350,000
$1,250,000
$750,000
$100,000
This is great however I need to be able to pull back all of this information in a more efficient and performant way such that all the data comes back in one pass.
At present I can do this using something like the code below:
all = content.select("a.listing-results-attr, h2.listing-results-address, a.listing-results-price")
This works somewhat but brings back too much additional HTML tags and is just not nearly as elegant or sophisticated as I require. Results as follows.
</a>, <h2 class="listing-results-attr">
15 room mansion for sale
</h2>, <a class="listing-results-address" href="redacted">22 Acacia Avenue, CityName Postcode</a>, <a class="listing-results-price" href="redacted">
$350,000
Expected results should look something like this:
15 room mansion for sale
22 Acacia Avenue, CityName Postcode
$350,000
3 bed barn conversion for sale
100 Sleepy Hollow, CityName Postcode
$1,250,000
etc
etc
I then need to be able to store the results as JSON objects for later analysis.
Thanks in advance.
Change your selectors as shown below:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
details = ([item.text.strip() for item in soup.select(".listing-results-attr a, .listing-results-address , .text-price")])
You can view separately with, for example,
prices = details[0::3]
descriptions = details[1::3]
addresses = details[2::3]
print(prices, descriptions, addresses)
find_all() function always returns a list, strip() is remove spaces at the beginning and at the end of the string.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
results = soup.find("ul",{'class':"listing-results clearfix js-gtm-list"})
for li in results.find_all("li",{'class':"srp clearfix"}):
price = li.find("a",{"class":"listing-results-price text-price"}).text.strip()
address = li.find("a",{'class':"listing-results-address"}).text.strip()
description = li.find("h2",{'class':"listing-results-attr"}).find('a').text.strip()
print(description)
print(address)
print(price)
O/P:
2 bed detached bungalow for sale
Bronrhiw Fach, Caerphilly CF83
£159,950
2 bed semi-detached house for sale
Cwrt Nant Y Felin, Caerphilly CF83
£159,950
3 bed semi-detached house for sale
Pen-Y-Bryn, Caerphilly CF83
£102,950
.....

How to extract rows from simple html table?

For some reason I am unable to extract the table from this simple html table.
from bs4 import BeautifulSoup
import requests
def main():
html_doc = requests.get(
'http://www.wolfson.cam.ac.uk/old-site/cgi/catering-menu?week=0;style=/0,vertical')
soup = BeautifulSoup(html_doc.text, 'html.parser')
table = soup.find('table')
print table
if __name__ == '__main__':
main()
I have the table, but I cannot understand the beautifulsoup documentation well enough to know how to extract the data. The data are in tr tags.
The website shows a simple HTML food menu.
I would like to output the day of the week and the menu for that day:
Monday:
Lunch: some_lunch, Supper: some_food
Tuesday:
Lunch: some_lunch, Supper: some_supper
and so on for all the days of the week. 'Formal Hall' can be ignored.
How can I iterate over the tr tags so that I can create this output?
I normally don't provide direct solutions. You should've tried some code and if you face any issue then post it here. But anyways, this is what I've written and it should help in giving you a head start.
soup = BeautifulSoup(r.content)
rows = soup.findAll("tr")
for i in xrange(1,8):
row = rows[i]
print row.find("th").text
for j in xrange(0,2):
print rows[0].findAll("th")[j+1].text.strip(), ": ",
td = row.findAll("td")[j]
for p in td.findAll("p"):
print p.text, ",",
print
print
Output will look something like this:
Monday
Lunch: Leek and Potato Soup, Spaghetti Bolognese with Garlic Bread, Red Pepper and Chickpea Stroganoff with Brown Rice, Chicken Goujons with Garlic Mayonnaise Dip, Vegetable Grills with Sweet Chilli Sauce, Coffee and Walnut Sponge with Custard,
Supper: Leek and Potato Soup, Breaded Haddock with Lemon and Tartare Sauce, Vegetable Samosa with Lentil Dahl, Chilli Beef Wraps, Steamed Strawberry Sponge with Custard,
Tuesday
Lunch: Tomato and Basil Soup, Pan-fried Harrisa Spiced Chicken with Roasted Vegetables, Vegetarian Spaghetti Bolognese with Garlic Bread, Jacket Potato with Various Fillings, Apple and Plum Pie with Custard,
Supper: Tomato and Basil Soup, Lamb Tagine with Fruit Couscous, Vegetable Biryani with Naan Bread, Pan-fried Turkey Escalope, Raspberry Shortbread,

Categories