I'm trying to get some datas from Walmart using Python and BeautifulSoup bs4.
Simply I wrote a code for get the all category names and that works:
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.walmart.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
r = requests.get('https://www.walmart.com/all-departments')
soup = BeautifulSoup(r.content, 'lxml')
sub_list = soup.find_all('div', class_='alldeps-DepartmentNav-link-wrapper display-inline-block u-size-1-3')
print(sub_list)
The problem is; when I try to get the values from this link by using the code below, I get empty results:
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.walmart.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
r = requests.get('https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391')
soup = BeautifulSoup(r.content, 'lxml')
general_list = soup.find_all('a', class_='product-title-link line-clamp line-clamp-2 truncate-title')
print(general_list)
As I searched on old docs, I see only SerpApi solution but it is paid solution so is there any way for get the values? Or am I doing something wrong?
Here is good tutotial for Selenium:
https://selenium-python.readthedocs.io/getting-started.html#simple-usage.
I've wrote a short script for you to get started. All you need is to download chromedriver(Chromium) and put it to path. For Windows, chromedriver will have .exe resolution
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.get("https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391")
assert "Walmart.com" in driver.title
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".product-title-link.line-clamp.line-clamp-2.truncate-title>span")))
elems = driver.find_elements_by_css_selector(".product-title-link.line-clamp.line-clamp-2.truncate-title>span")
for el in elems:
print(el.text)
driver.close()
My output:
Lance Sandwich Cookies, Nekot Lemon Creme, 8 Ct Box
Nature Valley Biscuits, Almond Butter Breakfast Biscuits w/ Nut Filling, 13.5 oz
Pepperidge Farm Soft Baked Strawberry Cheesecake Cookies, 8.6 oz. Bag
Nutter Butter Family Size Peanut Butter Sandwich Cookies, 16 oz
SnackWell's Devil's Food Cookie Cakes 6.75 oz. Box
Munk Pack Protein Cookies, Variety Pack, Vegan, Gluten Free, Dairy Free Snacks, 6 Count
Great Value Twist & Shout Chocolate Sandwich Cookies, 15.5 Oz.
CHIPS AHOY! Chewy Brownie Filled Chocolate Chip Cookies, 9.5 oz
Nutter Butter Peanut Butter Wafer Cookies, 10.5 oz
Nabisco Sweet Treats Cookie Variety Pack OREO, OREO Golden & CHIPS AHOY!, 30 Snack Packs (2 Cookies Per Pack)
Archway Cookies, Soft Dutch Cocoa, 8.75 oz
OREO Double Stuf Chocolate Sandwich Cookies, Family Size, 20 oz
OREO Chocolate Sandwich Cookies, Party Size, 25.5 oz
Fiber One Soft-Baked Cookies, Chocolate Chunk, 6.6 oz
Nature Valley Toasted Coconut Biscuits with Coconut Filling, 10 ct, 13.5 oz
Great Value Duplex Sandwich Creme Cookies Family Size, 25 Oz
Great Value Assorted Sandwich creme Cookies Family Size, 25 oz
CHIPS AHOY! Original Chocolate Chip Cookies, Family Size, 18.2 oz
Archway Cookies, Crispy Windmill, 9 oz
Nabisco Classic Mix Variety Pack, OREO Mini, CHIPS AHOY! Mini, Nutter Butter Bites, RITZ Bits Cheese, Easter Snacks, 20 Snack Packs
Mother's Original Circus Animal Cookies 11 oz
Lotus Biscoff Cookies, 8.8 Oz.
Archway Cookies, Crispy Gingersnap, 12 oz
Great Value Vanilla Creme Wafer Cookies, 8 oz
Pepperidge Farm Verona Strawberry Thumbprint Cookies, 6.75 oz. Bag
Absolutely Gluten Free Coconut Macaroons
Sheila G's Brownie Brittle GLUTEN-FREE Chocolate Chip Cookie Snack Thins, 4.5oz
CHIPS AHOY! Peanut Butter Cup Chocolate Cookies, Family Size, 14.25 oz
Great Value Lemon Sandwich Creme Cookies Family Size, 25 oz
Keebler Sandies Classic Shortbread Cookies 11.2 oz
Nabisco Cookie Variety Pack, OREO, Nutter Butter, CHIPS AHOY!, 12 Snack Packs
OREO Chocolate Sandwich Cookies, Family Size, 19.1 oz
Lu Petit Ecolier European Dark Chocolate Biscuit Cookies, 45% Cocoa, 5.3 oz
Keebler Sandies Pecan Shortbread Cookies 17.2 oz
CHIPS AHOY! Reeses Peanut Butter Cup Chocolate Chip Cookies, 9.5 oz
Fiber One Soft-Baked Cookies, Oatmeal Raisin, 6 ct, 6.6 oz
OREO Dark Chocolate Crme Chocolate Sandwich Cookies, Family Size, 17 oz
Pinwheels Pure Chocolate & Marshmallow Cookies, 12 oz
Keebler Fudge Stripes Original Cookies 17.3 oz
Pepperidge Farm Classic Collection Cookies, 13.25 oz. Box
It's because the website is dynamically rendered. So the javascript first need to run before it shows the product. Therefore you need somewhere to run the javascript (bs can't do that) Have a look at the selinium library.
Related
I am trying to scrap an eCommerce site by using beautiful soup. I've noticed that the HTML of the main page is coming incomplete. In addition, my python list is coming empty when I try to find almost any item.
Here is the code:
import requests
import bs4
res = requests.get("https://www.example.com")
soup = bs4.BeautifulSoup(res.content,"lxml")
productlist = soup.find_all('div', class_='items-IW-')
print(productlist)
The data you see on the page is loaded from external URL via JavaScript. You can use requests to simulate it. For example:
import json
import requests
url = "https://8r21li.a.searchspring.io/api/search/search.json"
params = {
"resultsFormat": "native",
"resultsPerPage": "24",
"page": "1",
"siteId": "8r21li",
"bgfilter.ss_category": "Women",
}
data = requests.get(url, params=params).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for r in data["results"]:
print("{:<50} {:<10}".format(r["name"], r["price"]))
Prints:
KMDCore Polypro Long Sleeve Unisex Top 29.98
Andulo Women's Rain Jacket v3 139.98
Epiq Women's 600 Fill Hooded Down Jacket 199.98
KMDCore Polypro Unisex Long Johns 29.98
Heli Women's Hooded Lightweight Down Vest 70
Accion driMOTION Low Cut Unisex Socks – 3Pk 31.49
Pocket-it Women’s Two Layer Rain Jacket 70
Unisex Thermo Socks 2Pk 34.98
NuYarn Ergonomic Unisex Hiking Socks 34.98
KMDCore Polypro Short Sleeve Unisex Top 29.98
Fyfe Unisex Beanie 17.49
Makino Travel Skort 79.98
Miro Women’s 3/4 Pants 69.98
Ridge 100 Women’s Primaloft Bio Pullover 59.98
Epiq Women's 600 Fill Down Vest 149.98
Epiq Women's 600 Fill Down Jacket 179.98
Flinders Women’s Pants 89.98
Flight Women’s Shorts 59.98
ANY-Time Sweats Joggers 79.98
Kamana Women’s Tapered Pants 89.98
NuYarn Ergonomic Quarter Crew Unisex Hiking Socks 31.49
Bealey Women’s GORE-TEX Jacket 190
Winterburn Women’s 600 Fill Longline Down Coat 299.98
Kathmandu Badge Unisex Beanie 27.98
I'm trying to get Chips names from this Target market link and trying to get all 28 chips automatically in first page. I wrote this code. Opens the link, scrolls down (to fetch the names and pictures) and tries to get the names;
import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager as CM
options = webdriver.ChromeOptions()
options.add_argument("--log-level=3")
mobile_emulation = {
"userAgent": 'Mozilla/5.0 (Linux; Android 4.0.3; HTC One X Build/IML74K) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/83.0.1025.133 Mobile Safari/535.19'
}
options.add_experimental_option("mobileEmulation", mobile_emulation)
bot = webdriver.Chrome(executable_path=CM().install(), options=options)
bot.get('https://www.target.com/c/chips-snacks-grocery/-/N-5xsy7')
bot.set_window_size(500, 950)
time.sleep(5)
for i in range(0,3):
ActionChains(bot).send_keys(Keys.END).perform()
time.sleep(1)
product_names = bot.find_elements_by_class_name('Link-sc-1khjl8b-0 styles__StyledTitleLink-mkgs8k-5 kdCHb inccCG h-display-block h-text-bold h-text-bs flex-grow-one')
hrefList = []
for e in product_names:
hrefList.append(e.get_attribute('href'))
for href in hrefList:
print(href)
When I inspect the names from browser, the common part of all chips is having Link-sc-1khjl8b-0 styles__StyledTitleLink-mkgs8k-5 kdCHb inccCG h-display-block h-text-bold h-text-bs flex-grow-one class name. So as you see I added find_elements_by_class_name('Link-sc-1khjl8b-0 styles__StyledTitleLink-mkgs8k-5 kdCHb inccCG h-display-block h-text-bold h-text-bs flex-grow-one') line. But it gives null result. What is wrong? Can you help me? Solution can be selenium or bs4 doesnt matter.
You can get all that data from the api as long as you feed in the correct key.
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1'
payload = {
'key': 'ff457966e64d5e877fdbad070f276d18ecec4a01',
'category': '5xsy7',
'channel': 'WEB',
'count': '28',
'default_purchasability_filter': 'true',
'include_sponsored': 'true',
'offset': '0',
'page': '/c/5xsy7',
'platform': 'desktop',
'pricing_store_id': '1771',
'scheduled_delivery_store_id': '1771',
'store_ids': '1771,1768,1113,3374,1792',
'useragent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36',
'visitor_id': '0179C80AE1090201B5D5C1D895ADEA6C'}
jsonData = requests.get(url, params=payload).json()
for each in jsonData['data']['search']['products']:
title = each['item']['product_description']['title']
buy_url = each['item']['enrichment']['buy_url']
image_url = each['item']['enrichment']['images']['primary_image_url']
print(title)
Output:
Ruffles Cheddar & Sour Cream Potato Chips - 2.5oz
Doritos 3D Crunch Chili Cheese Nacho - 6oz
Hippeas Vegan White Cheddar Organic Chickpea Puffs - 5oz
PopCorners Spicy Queso - 7oz
Doritos 3D Crunch Spicy Ranch - 6oz
Pringles Snack Stacks Variety Pack Potato Crisps Chips - 12.9oz/18ct
Frito-Lay Variety Pack Flavor Mix - 18ct
Doritos Nacho Cheese Chips - 9.75oz
Hippeas Nacho Vibes Organic Chickpea Puffs - 5oz
Tostitos Scoops Tortilla Chips -10oz
Ripple Potato Chips Party Size - 13.5oz - Market Pantry™
Ritz Crisp & Thins Cream Cheese & Onion Potato And Wheat Chips - 7.1oz
Pringles Sour Cream & Onion Potato Crisps Chips - 5.5oz
Original Potato Chips Party Size - 15.25oz - Market Pantry™
Organic White Corn Tortilla Chips - 12oz - Good & Gather™
Sensible Portions Sea Salt Garden Veggie Straws - 7oz
Traditional Kettle Chips - 8oz - Good & Gather™
Lay's Classic Potato Chips - 8oz
Cheetos Crunchy Flamin Hot - 8.5oz
Sweet Potato Kettle Chips - 7oz - Good & Gather™
SunChips Harvest Cheddar Flavored Wholegrain Snacks - 7oz
Frito-Lay Variety Pack Classic Mix - 18ct
Doritos Cool Ranch Chips - 10.5oz
Lay's Wavy Original Potato Chips - 7.75oz
Frito-Lay Variety Pack Family Fun Mix - 18ct
Cheetos Jumbo Puffs - 8.5oz
Frito-Lay Fun Times Mix Variety Pack - 28ct
Doritos Nacho Cheese Flavored Tortilla Chips - 15.5oz
Lay's Barbecue Flavored Potato Chips - 7.75oz
SunChips Garden Salsa Flavored Wholegrain Snacks - 7oz
Pringles Snack Stacks Variety Pack Potato Crisps Chips - 12.9oz/18ct
Frito-Lay Variety Pack Doritos & Cheetos Mix - 18ct
This also works:
product_names = bot.find_elements_by_xpath("//li[#data-test='list-entry-product-card']")
hrefList = []
for e in product_names:
print(e.find_element_by_css_selector("a").get_attribute("href"))
Try instead
product_names = bot.find_elements_by_css_selector('Link-sc-1khjl8b-0.styles__StyledTitleLink-mkgs8k-5.kdCHb.inccCG.h-display-block.h-text-bold.h-text-bs.flex-grow-one')
When using find_elements_by_class_name(), spaces in the class name are not handled properly.
Except that selector doesn't work for me, I need to use '.Link-sc-1khjl8b-0.ItemLink-sc-1eyz3ng-0.kdCHb.dtKueh'
So I'm trying to web scrape search results from Sportchek with BS4, specifically this link "https://www.sportchek.ca/categories/men/footwear/basketball-shoes.html?page=1". I want to get the prices off of the shoes here and put them all into a system to sort it, however, to do this I need to get the prices first and I cannot find a way to do that. In the HTML, the class is product-price-text but I can't glean anything off of it. At this point, getting even the price of only 1 shoe would be fine. I just need help on scraping anything class-related on BS4 because none of it works. I've tried
print(soup.find_all("span", class_="product-price-text"))
and even that won't work so please help.
The data is loaded dynamically via JavaScript. You can use the requests module to load it:
import json
import requests
url = "https://www.sportchek.ca/services/sportchek/search-and-promote/products?page=1&lastVisibleProductNumber=12&x1=ast-id-level-3&q1=men%3A%3Ashoes-footwear%3A%3Abasketball&preselectedCategoriesNumber=3&preselectedBrandsNumber=0&count=24"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0",
}
data = requests.get(url, headers=headers).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for p in data["products"]:
print("{:<10} {:<10} {}".format(p["code"], p["price"], p["title"]))
Prints:
332799300 83.97 Nike Unisex KD Trey 5 VII TB Basketball Shoes - Black/White/Volt - Black
333323940 180.0 Nike Men's Air Jordan 1 Zoom Air Comfort Basketball Shoes - Black/Chile Red-white-university Gold
333107663 134.99 Nike Men's Mamba Fury Basketball Shoes - Black/Smoke Grey/White
333003748 134.99 Nike Men's Lebron Witness IV Basketball Shoes - White
333003606 104.99 Nike Men's Kyrie Flytrap III Basketball Shoes - Black/Uni Red/Bright Crimson
333003543 94.99 Nike Men's Precision III Basketball Shoes - Black/White
333107554 94.99 Nike Men's Precision IV Basketball Shoes - Black/Mtlc Gold/Dk Smoke Grey
333107404 215.0 Nike Men's LeBron XVII Low Basketball Shoes - Black/White/Multicolor
333107617 119.99 Nike Men's KD Trey 5 VIII Basketball Shoes - Black/White-aurora Green/Smoke Grey
333166326 125.98 Nike Men's KD13 Basketball Shoes - Black/White-wolf Grey
333166731 138.98 Nike Men's LeBron XVII Low Basketball Shoes - Particle Grey/White-lt Smoke Grey-black
333183810 129.99 adidas Men's D.O.N 2 Basketball Shoes - Gold/Black/Gold
333206770 111.97 Under Armour Men's Embid Basketball Shoes - Red/White
333181809 165.0 Nike Men's Air Jordan React Elevation Basketball Shoes - Black/White-lt Smoke Grey-volt
333307276 104.99 adidas Men's Harden Stepback 2 Basketball Shoes - White/Blackwhite/Black
333017256 89.99 Under Armour Men's Jet Mid Sneaker - Black/Halo Grey
332912833 134.99 Nike Men's Zoom LeBron Witness IV Running Shoes - Black/Gym Red/University Red
332799162 79.88 Under Armour Men's Curry 7 "Quiet Eye" Basketball Shoes - Black - Black
333276525 119.99 Nike Men's Kyrie Flytrap IV Basketball Shoes - Black/White-metallic Silver
333106290 145.97 Nike Men's KD13 Basketball Shoes - Black/White/Wolf Grey
333181345 144.99 Nike Men's PG 4 TB Basketball Shoes - Black/White-pure Platinum
333241817 149.99 PUMA Men's Clyde All-Pro Basketball Shoes - Puma White/Blue Atolpuma White/Blue Atol
333186052 77.97 adidas Men's Harden Stepback Basketball Shoes - Black/Gold/White
333316063 245.0 Nike Men's Air Jordan 13 Retro Basketball Shoes - White/Blackwhite/Starfish-black
EDIT: To extract the API Url:
import re
import json
import requests
# your URL:
url = "https://www.sportchek.ca/categories/men/footwear/basketball-shoes.html?page=1"
api_url = "https://www.sportchek.ca/services/sportchek/search-and-promote/products?page=1&x1=ast-id-level-3&q1={cat}&count=24"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0",
}
html_text = requests.get(url, headers=headers).text
cat = re.search(r"br_data\.cat_id=\'(.*?)';", html_text).group(1)
data = requests.get(api_url.format(cat=cat), headers=headers).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for p in data["products"]:
print("{:<10} {:<10} {}".format(p["code"], p["price"], p["title"]))
Using Selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Chrome('/home/cam/Downloads/chromedriver')
url='https://www.sportchek.ca/categories/men/footwear/basketball-shoes.html?page=1'
browser.get(url)
time.sleep(10)
html = browser.page_source
soup = BeautifulSoup(html)
def get_data():
links = soup.find_all('span', attrs={'class':"product-price-text"})
for i in set(links):
print(i.text)
get_data()
Output:
$245.00
$215.00
$144.99
$165.00
$129.99
$104.99
$149.99
$195.00
$180.00
$119.99
$134.99
$89.99
$94.99
$215.00
I'm using the following code to retrieve all the image links on a webpage
from bs4 import BeautifulSoup
import requests
def get_txt(soup, key):
key_tag = soup.find('span', text=re.compile(key)).parent
return key_tag.find_all('span')[1].text
urldes = "https://www.johnpyeauctions.co.uk/lot_list.asp?saleid=4709&siteid=1"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(urldes, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
image_links = [x['data-img'] for x in soup.find_all('a', rel='popover')]
for link in image_links:
print(link)
I would like to apply the same principle in order to retrieve the text description that goes with each image:
soup.find_all(width='41%')
for text in soup.find_all('h5'):
print(text)
This code retrieves all the <h5> tags BUT not the specific tag with the parent (width='41%').
I have tried to apply the same loop as above for the image links:
image_text = [x['h5'] for x in soup.find_all(width='41%')]
for text in image_text:
print(text)
But I get the following error:
`Traceback (most recent call last):
File "C:\Users\alexa\Desktop\jpye_v2.py", line 41, in <module>
image_text = [x['h5'] for x in soup.find_all(width='41%')]
File "C:\Users\alexa\Desktop\jpye_v2.py", line 41, in <listcomp>
image_text = [x['h5'] for x in soup.find_all(width='41%')]
File "C:\Python36\lib\site-packages\beautifulsoup4-4.6.0-py3.6.egg\bs4\element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'h5'`
What I don't understand is why the tag h5 gives an error where the tag a does not or can I not use the same loop to index the text iterations in the same way as the image links?
First of all, simply writing this line soup.find_all(width='41%'), doesn't do anything. The find_all() method returns a list of all the matching tags. So, you'll have to store that in a variable first, and then iterate over it.
For your second code, tag['attribute'] is used to get the value of the attribute for the tag. So, using x['h5'] will raise a KeyError since h5 is not a attribute, but a tag.
Finally, to get the text that you want, you can use this:
for tag in soup.find_all('td', width='41%'):
image_text = tag.find('h5').text
print(image_text)
Or, to show how the find_all() method works, you can check this:
tags = soup.find_all('td', width='41%')
for tag in tags:
image_text = tag.find('h5').text
print(image_text)
Partial Output:
GUESS C0001G1 GENTS ROSE GOLD TONE AND BLUE BRUSH POLISHED SMART WATCH WITH VOICE COMMAND. FITTED WITH A BLUE SMOOTH SILICONE STRAP.BOXED AND PAPERS. RRP £259.00
GUESS I14503L1 LADIES SPORT WATCH WITH POLISHED SILVER COLOUR CASE WITH CRYSTALS, SILVER DIAL AND POLISHED SILVER COLOUR BRACELET. RRP £159
GUESS W0111L2 LADIES WATCH. POLISHED GOLD COLOUR CASE WITH CRYSTALS AND GOLD COLOUR MULTI-FUNCTION DIAL AND BRACELET. RRP £189
GUESS W0072L3 LADIES TREND WATCH. POLISHED ROSE GOLD CASE WITH CRYSTALS AND ROSE GOLD DIAL. POLISHED ROSE GOLD MULTI-CHAIN BRACELET WITH ADJUSTING G-LINK. RRP £159
GUESS W0330L2 LADIES SPORT WATCH. POLISHED ROSE GOLD COLOUR CASE WITH ROSE GOLD COLOUR CHRONO LOOK MULTI FUNCTION DIAL AND ROSE GOLD COLOUR BRACELET. RRP £169
GUESS W13573L1 LADIES SPORT WATCH. POLISHED GOLD COLOURED CASE WITH CRYSTAL AND WHITE MULTI FUNCTION DAIL AND POLISHED GOLD COLOURED BRACELET. RRP £169
GUESS W0674G6 MENS WATCH. ROSE GOLD CASE WITH BLACK TRIM AND SUN/ BLACK MULTI FUNCTION DIAL AND BLACK CROCODILE STYLE LEATHER BRACELET. RRP £169
GUESS W0564L1 LADIES SPORT WATCH. ROSE GOLD COLOUR CASING WITH BLUE TRIM AND CRYSTALS, WHITE MULTI FUNCTION DIAL WITH SMOOTH SILICONE STRAP. RRP £149
GUESS W0425L3 LADIES SPORT WATCH. POLISHED ROSE GOLD/ ANIMAL PRINT CASE AND SUN ROSE GOLD AND ANIMAL DAIL WITH POLISHES ROSE GOLD AND ANIMAL PRINT BRACELET. RRP £189
...
width=41% is an attribute. This will get you closer to what you want:
for text in soup.find_all('td', {'width': '41%'}):
print(text)
For some reason I am unable to extract the table from this simple html table.
from bs4 import BeautifulSoup
import requests
def main():
html_doc = requests.get(
'http://www.wolfson.cam.ac.uk/old-site/cgi/catering-menu?week=0;style=/0,vertical')
soup = BeautifulSoup(html_doc.text, 'html.parser')
table = soup.find('table')
print table
if __name__ == '__main__':
main()
I have the table, but I cannot understand the beautifulsoup documentation well enough to know how to extract the data. The data are in tr tags.
The website shows a simple HTML food menu.
I would like to output the day of the week and the menu for that day:
Monday:
Lunch: some_lunch, Supper: some_food
Tuesday:
Lunch: some_lunch, Supper: some_supper
and so on for all the days of the week. 'Formal Hall' can be ignored.
How can I iterate over the tr tags so that I can create this output?
I normally don't provide direct solutions. You should've tried some code and if you face any issue then post it here. But anyways, this is what I've written and it should help in giving you a head start.
soup = BeautifulSoup(r.content)
rows = soup.findAll("tr")
for i in xrange(1,8):
row = rows[i]
print row.find("th").text
for j in xrange(0,2):
print rows[0].findAll("th")[j+1].text.strip(), ": ",
td = row.findAll("td")[j]
for p in td.findAll("p"):
print p.text, ",",
print
print
Output will look something like this:
Monday
Lunch: Leek and Potato Soup, Spaghetti Bolognese with Garlic Bread, Red Pepper and Chickpea Stroganoff with Brown Rice, Chicken Goujons with Garlic Mayonnaise Dip, Vegetable Grills with Sweet Chilli Sauce, Coffee and Walnut Sponge with Custard,
Supper: Leek and Potato Soup, Breaded Haddock with Lemon and Tartare Sauce, Vegetable Samosa with Lentil Dahl, Chilli Beef Wraps, Steamed Strawberry Sponge with Custard,
Tuesday
Lunch: Tomato and Basil Soup, Pan-fried Harrisa Spiced Chicken with Roasted Vegetables, Vegetarian Spaghetti Bolognese with Garlic Bread, Jacket Potato with Various Fillings, Apple and Plum Pie with Custard,
Supper: Tomato and Basil Soup, Lamb Tagine with Fruit Couscous, Vegetable Biryani with Naan Bread, Pan-fried Turkey Escalope, Raspberry Shortbread,