How to extract rows from simple html table? - python

For some reason I am unable to extract the table from this simple html table.
from bs4 import BeautifulSoup
import requests
def main():
html_doc = requests.get(
'http://www.wolfson.cam.ac.uk/old-site/cgi/catering-menu?week=0;style=/0,vertical')
soup = BeautifulSoup(html_doc.text, 'html.parser')
table = soup.find('table')
print table
if __name__ == '__main__':
main()
I have the table, but I cannot understand the beautifulsoup documentation well enough to know how to extract the data. The data are in tr tags.
The website shows a simple HTML food menu.
I would like to output the day of the week and the menu for that day:
Monday:
Lunch: some_lunch, Supper: some_food
Tuesday:
Lunch: some_lunch, Supper: some_supper
and so on for all the days of the week. 'Formal Hall' can be ignored.
How can I iterate over the tr tags so that I can create this output?

I normally don't provide direct solutions. You should've tried some code and if you face any issue then post it here. But anyways, this is what I've written and it should help in giving you a head start.
soup = BeautifulSoup(r.content)
rows = soup.findAll("tr")
for i in xrange(1,8):
row = rows[i]
print row.find("th").text
for j in xrange(0,2):
print rows[0].findAll("th")[j+1].text.strip(), ": ",
td = row.findAll("td")[j]
for p in td.findAll("p"):
print p.text, ",",
print
print
Output will look something like this:
Monday
Lunch: Leek and Potato Soup, Spaghetti Bolognese with Garlic Bread, Red Pepper and Chickpea Stroganoff with Brown Rice, Chicken Goujons with Garlic Mayonnaise Dip, Vegetable Grills with Sweet Chilli Sauce, Coffee and Walnut Sponge with Custard,
Supper: Leek and Potato Soup, Breaded Haddock with Lemon and Tartare Sauce, Vegetable Samosa with Lentil Dahl, Chilli Beef Wraps, Steamed Strawberry Sponge with Custard,
Tuesday
Lunch: Tomato and Basil Soup, Pan-fried Harrisa Spiced Chicken with Roasted Vegetables, Vegetarian Spaghetti Bolognese with Garlic Bread, Jacket Potato with Various Fillings, Apple and Plum Pie with Custard,
Supper: Tomato and Basil Soup, Lamb Tagine with Fruit Couscous, Vegetable Biryani with Naan Bread, Pan-fried Turkey Escalope, Raspberry Shortbread,

Related

How to scrape text of a span sibling span?

Hello I'm trying to learn how to web scrape so I started by trying to web scrape my school menu.
Ive come into a problem were I can't get the menu items under a span class but instead get the the word within the same line of the span class "show".
here is a short amount of the html text I am trying to work with
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path=chromedriver.exe')#changed this
driver.get('https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/')
results = []
content = driver.page_source
soups = BeautifulSoup(content, 'html.parser')
element=soups.findAll('span',class_ = 'collapsible-heading-status')
for span in element:
print(span.text)
I have tried to make it into span.span.text but that wouldn't return me anything so can some one give me some pointer on how to extract the info under the collapsible-heading-status class.
Yummy waffles - As mentioned they are gone, but to get your goal an approach would be to select the names via css selectors using the adjacent sibling combinator:
for e in soup.select('.collapsible-heading-status + span'):
print(e.text)
or with find_next_sibling():
for e in soup.find_all('span',class_ = 'collapsible-heading-status'):
print(e.find_next_sibling('span').text)
Example
To get the whole information for each in a structured way you could use:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/")
soup = BeautifulSoup(driver.page_source, 'html.parser')
data = []
for e in soup.select('.nutrition'):
d = {
'meal':e.find_previous('h4').text,
'title':e.find_previous('h5').text,
'name':e.find_previous('span').text,
'description': e.p.text
}
d.update({n.text:n.find_next().text.strip(': ') for n in e.select('h6')})
data.append(d)
data
Output
[{'meal': 'Breakfast',
'title': 'Fresh Inspirations',
'name': 'Vanilla Chia Seed Pudding with Blueberrries',
'description': 'Vanilla chia seed pudding with blueberries, shredded coconut, and toasted almonds',
'Serving Size': '1 serving',
'Calories': '392.93',
'Fat (g)': '36.34',
'Carbohydrates (g)': '17.91',
'Protein (g)': '4.59',
'Allergens': 'Tree Nuts/Coconut',
'Ingredients': 'Coconut milk, chia seeds, beet sugar, imitation vanilla (water, vanillin, caramel color, propylene glycol, ethyl vanillin, potassium sorbate), blueberries, shredded sweetened coconut (desiccated coconut processed with sugar, water, propylene glycol, salt, sodium metabisulfite), blanched slivered almonds'},
{'meal': 'Breakfast',
'title': 'Fresh Inspirations',
'name': 'Housemade Granola',
'description': 'Crunchy and sweet granola made with mixed nuts and old fashioned rolled oats',
'Serving Size': '1/2 cup',
'Calories': '360.18',
'Fat (g)': '17.33',
'Carbohydrates (g)': '47.13',
'Protein (g)': '8.03',
'Allergens': 'Gluten/Wheat/Dairy/Peanuts/Tree Nuts',
'Ingredients': 'Old fashioned rolled oats (per manufacturer, may contain wheat/gluten), sunflower seeds, seedless raisins, unsalted butter, pure clover honey, peanut-free mixed nuts (cashews, almonds, sunflower oil and/or cottonseed oil, pecans, hazelnuts, dried Brazil nuts, salt), light brown beet sugar, molasses'},...]

Web Scrapping (ecommerce) - empty list

I am trying to scrap an eCommerce site by using beautiful soup. I've noticed that the HTML of the main page is coming incomplete. In addition, my python list is coming empty when I try to find almost any item.
Here is the code:
import requests
import bs4
res = requests.get("https://www.example.com")
soup = bs4.BeautifulSoup(res.content,"lxml")
productlist = soup.find_all('div', class_='items-IW-')
print(productlist)
The data you see on the page is loaded from external URL via JavaScript. You can use requests to simulate it. For example:
import json
import requests
url = "https://8r21li.a.searchspring.io/api/search/search.json"
params = {
"resultsFormat": "native",
"resultsPerPage": "24",
"page": "1",
"siteId": "8r21li",
"bgfilter.ss_category": "Women",
}
data = requests.get(url, params=params).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for r in data["results"]:
print("{:<50} {:<10}".format(r["name"], r["price"]))
Prints:
KMDCore Polypro Long Sleeve Unisex Top 29.98
Andulo Women's Rain Jacket v3 139.98
Epiq Women's 600 Fill Hooded Down Jacket 199.98
KMDCore Polypro Unisex Long Johns 29.98
Heli Women's Hooded Lightweight Down Vest 70
Accion driMOTION Low Cut Unisex Socks – 3Pk 31.49
Pocket-it Women’s Two Layer Rain Jacket 70
Unisex Thermo Socks 2Pk 34.98
NuYarn Ergonomic Unisex Hiking Socks 34.98
KMDCore Polypro Short Sleeve Unisex Top 29.98
Fyfe Unisex Beanie 17.49
Makino Travel Skort 79.98
Miro Women’s 3/4 Pants 69.98
Ridge 100 Women’s Primaloft Bio Pullover 59.98
Epiq Women's 600 Fill Down Vest 149.98
Epiq Women's 600 Fill Down Jacket 179.98
Flinders Women’s Pants 89.98
Flight Women’s Shorts 59.98
ANY-Time Sweats Joggers 79.98
Kamana Women’s Tapered Pants 89.98
NuYarn Ergonomic Quarter Crew Unisex Hiking Socks 31.49
Bealey Women’s GORE-TEX Jacket 190
Winterburn Women’s 600 Fill Longline Down Coat 299.98
Kathmandu Badge Unisex Beanie 27.98

Webscraping past a show more button that extends the page

I'm trying to scrape data from Elle.com under a search term. I noticed when I click the button, it sends a request that updates the &page=2 in the url. However, the following code just gets me a lot of duplicate entries. I need help finding a way to set a start point for each iteration of the loop (I think). Any ideas?
import requests,nltk,pandas as pd
from bs4 import BeautifulSoup as bs
def get_hits(url):
r = requests.get(url)
soup = bs(r.content, 'html')
body = []
for p in soup.find_all('p',{'class':'body-text'}):
sentences = nltk.sent_tokenize(p.text)
result1 = [s for s in sentences if 'kim' in s]
body.append(result1)
result2 = [s for s in sentences if 'kanye' in s]
body.append(result2)
body = [a for a in body if a!=[]]
if body == []:
body.append("no hits")
return body
titles =[]
key_hits = []
urls = []
counter = 1
for i in range(1,10):
url = f'https://www.elle.com/search/?page={i}&q=kanye'
r = requests.get(url)
soup = bs(r.content, 'html')
groups = soup.find_all('div',{'class':'simple-item grid-simple-item'})
for j in range(len(groups)):
urls.append('https://www.elle.com'+ groups[j].find('a')['href'])
titles.append(groups[j].find('div',{'class':'simple-item-title item-title'}).text)
key_hits.append(get_hits('https://www.elle.com'+ groups[j].find('a')['href']))
if (counter == 100):
break
counter+=1
data = pd.DataFrame({
'Title':titles,
'Body':key_hits,
'Links':urls
})
data.head()
Let me know if there's something I don't understand that I probably should. Just a marketing researcher trying to learn powerful tools here.
To get pagination working on the sige, you can use their infinite-scroll API URL (this example will print 9*42 titles):
import requests
from bs4 import BeautifulSoup
api_url = "https://www.elle.com/ajax/infiniteload/"
params = {
"id": "search",
"class": "CoreModels\\search\\TagQueryModel",
"viewset": "search",
"trackingId": "search-results",
"trackingLabel": "kanye",
"params": '{"input":"kanye","page_size":"42"}',
"page": "1",
"cachebuster": "undefined",
}
all_titles = set()
for page in range(1, 10):
params["page"] = page
soup = BeautifulSoup(
requests.get(api_url, params=params).content, "html.parser"
)
for title in soup.select(".item-title"):
print(title.text)
all_titles.add(title.text)
print()
print("Unique titles:", len(all_titles)) # <-- 9 * 42 = 378
Prints:
...
Kim Kardashian and Kanye West Respond to Those Divorce Rumors
People Are Noticing Something Fishy About Taylor Swift's Response to Kim Kardashian
Kim Kardashian Just Went on an Intense Twitter Rant Defending Kanye West
Trump Is Finally Able to Secure a Meeting With a Kim
Kim Kardashian West is Modeling Yeezy on the Street Again
Aziz Ansari's Willing to Model Kanye's Clothes
Unique titles: 378
Actually, load more pagination is generating from api calls plain html response and each page link/url is relative url and convert it into absolute url using urljoin method and I make pagination in api_urls.
Code:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
api_urls = ["https://www.elle.com/ajax/infiniteload/?id=search&class=CoreModels%5Csearch%5CTagQueryModel&viewset=search&trackingId=search-results&trackingLabel=kanye&params=%7B%22input%22%3A%22kanye%22%2C%22page_size%22%3A%2242%22%7D&page="+str(x)+"&cachebuster=undefined" for x in range(1,4)]
Base_url = "https://store.steampowered.com"
for url in api_urls:
req = requests.get(url)
soup = BeautifulSoup(req.content,"lxml")
cards = soup.select("div.simple-item.grid-simple-item")
for card in cards:
title = card.select_one("div.simple-item-title.item-title")
p = card.select_one("a")
l=p['href']
abs_link=urljoin(Base_url,l)
print("Title:" + title.text + " Links: " + abs_link)
print("-" * 80)
Output:
Title:Inside Kim Kardashian and Kanye West’s Current Relationship Amid Dinner Sighting Links: https://store.steampowered.com/culture/celebrities/a37833256/kim-kardashian-kanye-west-reconciled/
Title:Kim Kardashian And Ex Kanye West Left For SNL Together Amid Reports of Reconciliation Efforts Links: https://store.steampowered.com/culture/celebrities/a37919434/kim-kardashian-kanye-west-leave-for-snl-together-reconciliation/
Title:Kim Kardashian Wore a Purple Catsuit for Dinner With Kanye West Amid Reports She's Open to Reconciling Links: https://store.steampowered.com/culture/celebrities/a37822625/kim-kardashian-kanye-west-nobu-dinner-september-2021/
Title:How Kim Kardashian Really Feels About Kanye West Saying He ‘Wants Her Back’ Now Links:
https://store.steampowered.com/culture/celebrities/a37463258/kim-kardashian-kanye-west-reconciliation-feelings-september-2021/
Title:Why Irina Shayk and Kanye West Called Off Their Two-Month Romance Links: https://store.steampowered.com/culture/celebrities/a37366860/why-irina-shayk-kanye-west-broke-up-august-2021/
Title:Kim Kardashian and Kanye West Reportedly Are ‘Working on Rebuilding’ Relationship and May Call Off Divorce Links: https://store.steampowered.com/culture/celebrities/a37421190/kim-kardashian-kanye-west-repairing-relationship-divorce-august-2021/
Title:What Kim Kardashian and Kanye West's ‘Donda’ Wedding Moment Really Means for Their Relationship Links: https://store.steampowered.com/culture/celebrities/a37415557/kim-kardashian-kanye-west-donda-wedding-moment-explained/
Title:What Kim Kardashian and Kanye West's Relationship Is Like Now: ‘The Tension Has Subsided’ Links: https://store.steampowered.com/culture/celebrities/a37383301/kim-kardashian-kanye-west-relationship-details-august-2021/
Title:How Kim Kardashian and Kanye West’s Relationship as Co-Parents Has Evolved Links: https://store.steampowered.com/culture/celebrities/a37250155/kim-kardashian-kanye-west-co-parents/Title:Kim Kardashian Went Out in a Giant Shaggy Coat and a Black Wrap Top for Dinner in NYC Links: https://store.steampowered.com/culture/celebrities/a37882897/kim-kardashian-shaggy-coat-black-outfit-nyc-dinner/
Title:Kim Kardashian Wore Two Insane, Winter-Ready Outfits in One Warm NYC Day Links: https://store.steampowered.com/culture/celebrities/a37906750/kim-kardashian-overdressed-fall-outfits-october-2021/
Title:Kim Kardashian Dressed Like a Superhero for Justin Bieber's 2021 Met Gala After Party Links: https://store.steampowered.com/culture/celebrities/a37593656/kim-kardashian-superhero-outfit-met-gala-after-party-2021/
Title:Kim Kardashian Killed It In Her Debut as a Saturday Night Live Host Links: https://store.steampowered.com/culture/celebrities/a37918950/kim-kardashian-saturday-night-live-best-sketches/
Title:Kim Kardashian Has Been Working ‘20 Hours a Day’ For Her Appearance On SNL Links: https://store.steampowered.com/culture/celebrities/a37915962/kim-kardashian-saturday-night-live-preperation/
Title:Why Taylor Swift and Joe Alwyn Skipped the 2021 Met Gala Links: https://store.steampowered.com/culture/celebrities/a37446411/why-taylor-swift-joe-alwyn-skipped-met-gala-2021/
Title:Kim Kardashian Says North West Still Wants to Be an Only Child Five Years Into Having Siblings Links: https://store.steampowered.com/culture/celebrities/a37620539/kim-kardashian-north-west-only-child-comment-september-2021/
Title:How Kim Kardashian's Incognito 2021 Met Gala Glam Came Together Links: https://store.s
teampowered.com/beauty/makeup-skin-care/a37584576/kim-kardashians-incognito-2021-met-gala-beauty-breakdown/
Title:Kim Kardashian Completely Covered Her Face and Everything in a Black Balenciaga Look at the 2021 Met Gala Links: https://store.steampowered.com/culture/celebrities/a37578520/kim-kardashian-faceless-outfit-met-gala-2021/
Title:How Kim Kardashian Feels About Kanye West Singing About Their Divorce and ‘Losing My Family’ on Donda Album Links: https://store.steampowered.com/culture/celebrities/a37113130/kim-kardashian-kanye-west-divorce-song-donda-album-feelings/
Title:Kanye West Teases New Song In Beats By Dre Commercial Starring Sha'Carri Richardson Links: https://store.steampowered.com/culture/celebrities/a37090223/kanye-west-teases-new-song-in-beats-by-dre-commercial-starring-shacarri-richardson/
Title:Inside Kim Kardashian and Kanye West's Relationship Amid His Irina Shayk Romance Links: https://store.steampowered.com/culture/celebrities/a37077662/kim-kardashian-kanye-west-relationship-irina-shayk-romance-july-2021/
and ... so on

Having trouble indexing into recipe with BeautifulSoup

I am writing a program to iterate through a recipe website, the Woks of Life, and extract each recipe and store it in a CSV file. I have managed to extract the links for storage purpose, but I am having trouble extracting the elements on the page. The website link is https://thewoksoflife.com/baked-white-pepper-chicken-wings/. The elements that I am trying to reach are the name, cook time, ingredients, calories, instructions, etc.
def parse_recipe(link):
#hardcoded link for now until i get it working
page = requests.get("https://thewoksoflife.com/baked-white-pepper-chicken-wings/")
soup = BeautifulSoup(page.content, 'html.parser')
for i in soup.findAll("script", {"class": "yoast-schema-graph yoast-schema-graph--main"}):
print(i.get("name")) #should print "Baked White Pepper Chicken Wings" but prints "None"
For reference, when I print(i), I get:
<script class="yoast-schema-graph yoast-schema-graph--main" type="application/ld+json">
{"#context":"https://schema.org","#graph":
[{"#type":"Organization","#id":"https://thewoksoflife.com/#organization","name":"The Woks of
Life","url":"https://thewoksoflife.com/","sameAs":
["https://www.facebook.com/thewoksoflife","https://twitter.com/thewoksoflife"],"logo":
{"#type":"ImageObject","#id":"https://thewoksoflife.com/#logo","url":"https://thewoksoflife.com/wp-
content/uploads/2019/05/Temporary-Logo-e1556728319201.png","width":365,"height":364,"caption":"The
Woks of Life"},"image":{"#id":"https://thewoksoflife.com/#logo"}}{"#type":"WebSite","#id":"https://thewoksoflife.com/#website","url":"https://thewoksoflife.com/","name":
"The Woks of Life","description":"a culinary genealogy","publisher":
{"#id":"https://thewoksoflife.com/#organization"},"potentialAction":
{"#type":"SearchAction","target":"https://thewoksoflife.com/?s={search_term_string}","query-
input":"required name=search_term_string"}},
{"#type":"ImageObject","#id":"https://thewoksoflife.com/baked-white-pepper-chicken-
wings/#primaryimage","url":"https://thewoksoflife.com/wp-content/uploads/2019/11/white-pepper-
chicken-wings-9.jpg","width":600,"height":836,"caption":"Crispy Baked White Pepper Chicken Wings,
thewoksoflife.com"},{"#type":"WebPage","#id":"https://thewoksoflife.com/baked-white-pepper-
chicken-wings/#webpage","url":"https://thewoksoflife.com/baked-white-pepper-chicken-
wings/","inLanguage":"en-US","name":"Baked White Pepper Chicken Wings | The Woks of
Life", .................. #continues onwards
I am trying to access the "name" (as well as other similarly unaccessable elements) located at the end of the code snippet above, but am unable to do so.
Any help would be appreciated!
The data is in JSON format, so after locating the <script> tag, you can parse it with JSON module. For exemple:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://thewoksoflife.com/baked-white-pepper-chicken-wings/'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
data = json.loads( soup.select_one('script.yoast-schema-graph.yoast-schema-graph--main').text )
# print(json.dumps(data, indent=4)) # <-- uncomment this to print all data
recipe = next((g for g in data['#graph'] if g.get('#type', '') == 'Recipe'), None)
if recipe:
print('Name =', recipe['name'])
print('Cook Time =', recipe['cookTime'])
print('Ingredients =', recipe['recipeIngredient'])
# ... etc.
Prints:
Name = Baked White Pepper Chicken Wings
Cook Time = PT40M
Ingredients = ['3 pounds whole chicken wings ((about 14 wings))', '1-2 tablespoons white pepper powder ((divided))', '2 teaspoons salt ((divided))', '1 teaspoon Sichuan peppercorn powder ((optional))', '2 teaspoons vegetable oil ((plus more for brushing))', '1/2 cup all purpose flour', '1/4 cup cornstarch']

Get Text from h1 with BeautifulSoup

I was asked to get a product name from a web.
I was asked to get this text:
SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
This is my BeautifulSoup code:
import requests
from bs4 import BeautifulSoup
get = requests.get('https://www.ebay.com/itm/SEIKO-5-AUTOMATIC-MENS-STEEL-VINTAGE-JAPAN-MADE-BLACK-DIAL-WATCH-RUN-ORDER-K/143420840058?epid=18032713872&_trkparms=ispr%3D1&hash=item21648c587a:g:ZzEAAOSw9MRdsI8v&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8')
soup = BeautifulSoup(get.text, 'lxml')
company = soup.select('h1.it-ttl')[0].text.strip()
print(company)
The HTML from the code is:
<h1 class="it-ttl" id="itemTitle" itemprop="name">
<span class="g-hdn">Details about
</span>
SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
</h1>
Instead of the desired text, I get this:
Details about SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
How can I extract only the product name?
import requests
from bs4 import BeautifulSoup
get = requests.get('https://www.ebay.com/itm/SEIKO-5-AUTOMATIC-MENS-STEEL-VINTAGE-JAPAN-MADE-BLACK-DIAL-WATCH-RUN-ORDER-K/143420840058?epid=18032713872&_trkparms=ispr%3D1&hash=item21648c587a:g:ZzEAAOSw9MRdsI8v&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8')
soup = BeautifulSoup(get.text, 'html.parser')
company = soup.select('h1.it-ttl')[0].text.strip()
span_text = soup.select('span.g-hdn')[0].text.strip()
print(company)
print(span_text)
print(company.lstrip(span_text))
Since the span tag is nested in the h1 tag, the necessary step is to extract the span text and remove it from the h1 tag with the lstrip method.

Categories