i am trying to scrape 2 elements from this url : https://www.welcometothejungle.com/fr/companies/dataiku/jobs/ai-solutions-manager-life-science_london_DATAI_a2jpa5o . CDI and LONDON, the issue here is that :
They have the same classe
They are in the same div
for London :
<li class="sc-1qc42fc-0 kExFnG"><span role="img" class="sc-1qc42fc-3 heity"><i name="location" class="sc-kmATbt bGKMNx"></i></span><span class="sc-1qc42fc-2 jmExaK">London</span></li>
for CDI :
<li class="sc-1qc42fc-0 kExFnG"><span role="img" class="sc-1qc42fc-3 heity"><i name="contract" class="sc-kmATbt jYkMSd"></i></span><span class="sc-1qc42fc-2 jmExaK"><span>CDI</span> </span></li>
I can see that both html codes have one thing that this different, the "i": one's name is location, the other contract, but i can't seem to find a way tu use this info in order to scrape the correct element
How can i manage to do a soup.find that will allow me to extract bot element "CDI and London" ?
From what I understand, this should work for you:
# Get all the children of the parent of the first li with that class
lis = list(soup.find_all('li', attrs={'class': 'sc-1qc42fc-0 kExFnG'})[0].parent.children)
fields = {}
for li in lis:
fields[li.find('i').get('name')] = li.text.strip()
print(fields)
Output:
{'contract': 'CDI', 'location': 'London'}
Related
I am trying to create a data frame in python of a ecommerce site's menu items. Below is an example of what the menu looks like. It is very nested and has several submenus within the main navigation bar.
Menu Example
Because there are several submenus I am having trouble creating a loop that will parse out the values for all the elements at each level. Here is an example of what the HTML looks like.
Full HTML Example
<ul class="c-nav-menu ''">
<li class="dropDown hoverFade topNav category-6 c-furniture ">
<a class="c-nav-category_link topnav-furniture" href="xx">
<span>Furniture</span>
</a>
<div class="c-nav-columns hoverFadeTarget c-nav-columns-furniture ">
<div class="c-nav-columns_group columns_group-size--5">
<div class="col col-0">
<div class="nav-columns-section-heading living-room-furniture">Living Room Furniture</div>
<ul>
<li>
<a href="xx2"> Sofa & Sectional Collections
I would like the final data frame to look like this:
Category 1
Category 2
Category 3
Category 3 Link Path
Furniture
Living Room Furniture
Sofas & Sectionals
https://www.xx./furniture/living_room/sofa
Furniture
Living Room Furniture
Chairs
https://www.xx./furniture/living_room/chairs
Furniture
Living Room Furniture
Coffee Tables
https://www.xx./furniture/living*room/coffee_*table
I have tried using both beautiful soup and selenium to parse each of the levels out but have not been successful in getting it to iterate.
For Category 1 I used to below code and it gave me my expected result.
input:
nav = soup.find_all("a", {"class" : lambda L: L and L.startswith('c-nav-category_link topnav')})
for item in nav:
level1 = {
'cat_1': item.find("span").text}
print(level1)
output:
{'cat_1': 'Furniture'}
{'cat_1': 'Outdoor & Garden'}
{'cat_1': 'Rugs'}...
This method also worked for Category 2, however, when I tried to repeat this for category 3, I was only able to pull out the first item under each submenu instead of all the items under each submenu.
input:
final_link = soup.find_all("div", {"class" : lambda L: L and L.startswith('col col-')},"li")
for final_item in final_link:
level3 = {
'cat_3': final_item.find("a").text.replace('\n',"")}
print(level3)
output:
{'cat_3': 'Sofa & Sectional Collections'}
{'cat_3': 'Bedroom Collections'}
{'cat_3': 'Dining Collections'}...
This should return: Sofas & Sectionals, Sectionals, Sofas & Loveseats. But instead it skips to the next submenu. I also tried to parse using selenium and XPATH but was unsuccessful.
from selenium.webdriver.common.by import By
title = driver.find_elements(by=By.XPATH, value='//*[#id="topnav-container"]/ul/li')
for li in title:
product_data = {'title': li.text}
print(product_data)
Would anyone have any suggestions on how to parse this correctly?
Preface: I am doing web-scraping on a real estate website out of curiosity.
Being a complete newbie for python, I have been modifying codes from other shared codes as a way to learn.
I stumbled upon a new challenge that I have never learned how to do this anywhere. So, I would like to ask the community for help.
What I want: I would like to extract the values "4" and "3" under the <li> elements as separate items. Please see the image I attached to this post for the excerpt of elements from the website.
What I attempted: I see that they are listed under div class="list-card-heading" so I tried card.find("div", {"class":"list-card-heading"}).find("ul").find("li")) in the code below for the attribute named 'bed_bath'. But I only got the first value embeded in the HTML...
content = BeautifulSoup(response,"lxml")
deck = content.find('ul',{'class':'photo-cards photo-cards_wow photo-cards_short'})
for card in deck.contents:
script = card.find('script',{'type': 'application/ld+json'})
if script:
script_json = json.loads(script.contents[0])
self.results.append({
'latitude': script_json['geo']['latitude'],
'longitude': script_json['geo']['longitude'],
'floorSize': script_json['floorSize']['value'],
'url': script_json['url'],
'price': card.find('div', {'class': 'list-card-price'}).text,
'bed_bath': (card.find("div", {"class":"list-card-heading"}).find("ul").find("li")),
'address': card.find('address', {'class':'list-card-addr'}).text
})
Result from my newbie attempt: <li>3<abbr class="list-card-label"> <!-- -->bds</abbr></li>
Please help
Image:
Elements from the website
You'll want to use a combination of the find_all function and the text attribute.
elements = card.find("div", { "class": "list-card-heading" }).find("ul").find_all("li")) # get all <li> elements in the <el>
values = []
for element in elements:
values.append(element.text) # get the inner text from the <li> element
or, more concisely:
values = [element.text for element in card.find("div", { "class": "list-card-heading" }).find("ul").find_all("li"))]
To get ["3", "4"] from the HTML snippet, you can do:
from bs4 import BeautifulSoup
txt = '''<ul class="list-card-details">
<li>
"4"
<abbr class="list-card-label">bds</abbr>
</li>
<li>
"3"
<abbr class="list-card-label">ba</abbr>
</li>
</ul>
'''
soup = BeautifulSoup(txt, 'html.parser')
out = [li.contents[0].strip() for li in soup.select('ul.list-card-details li')]
print(out)
Prints:
['"4"', '"3"']
Or:
out = [li.find(text=True).strip() for li in soup.select('ul.list-card-details li')]
Or:
out = [li.get_text(strip=True, separator='|').split('|')[0] for li in soup.select('ul.list-card-details li')]
I have my soup data like below.
<a href="/title/tt0110912/" title="Quentin Tarantino">
Pulp Fiction
</a>
<a href="/title/tt0137523/" title="David Fincher">
Fight Club
</a>
<a href="blablabla" title="Yet to Release">
Yet to Release
</a>
<a href="something" title="Movies">
Coming soon
</a>
I need the text data from those a tags on a condition, maybe href=/title/*wildcharacter*
My could somewhat looks like this.
titles = []
for a in soup.find_all("a",href=True):
if a.text:
titles.append(a.text.replace('\n'," "))
print(titles)
But with this condition, i get texts from all the a tags. I need only texts where href has "/title/***".
I guess you want it like this:
from bs4 import BeautifulSoup
html = '''<a href="/title/tt0110912/" title="Quentin Tarantino">
Pulp Fiction
</a>
<a href="/title/tt0137523/" title="David Fincher">
Fight Club
</a>
<a href="blablabla" title="Yet to Release">
Yet to Release
</a>
<a href="something" title="Movies">
Coming soon
</a>
'''
soup = BeautifulSoup(html, 'html.parser')
titles = []
for a in soup.select('a[href*="/title/"]',href=True):
if a.text:
titles.append(a.text.replace('\n'," "))
print(titles)
Output:
[' Pulp Fiction ', ' Fight Club ']
You can use a regular expression to search for the contents of an attribute (in this case href).
For more details please refer to this answer: https://stackoverflow.com/a/47091570/1426630
1.) To get all <a> tags, where the href= begins with "/title/", you can use CSS selector a[href^="/title/"].
2.) To strip all text inside the tag, you can use .get_text() with parameter strip=True
soup = BeautifulSoup(html_text, 'html.parser')
out = [a.get_text(strip=True) for a in soup.select('a[href^="/title/"]')]
print(out)
Prints:
['Pulp Fiction', 'Fight Club']
I am trying to scrape a website using Selenium. The element on the site is formatted in a way where it has 3 categories worth of information I want to split up. The following is the code when I inspect element on my browser for what I am trying to scrape.
<div class="break-text ng-binding ng-scope" ng-if="category.dataType == "breakText"">
Pinson, AL
<br>
Pinson Valley
</div>
This format has:
"City", "State"
.
"High School"
.
"Pinson", "AL"
.
"Pinson Valley"
.
respectively. How do I differentiate these lists when scraping the data?
city = driver.find_elements_by_class_name('break-text')
state = driver.find_elements_by_class_name('break-text')
highschool = driver.find_elements_by_class_name('break-text')
Try something like this:
data= driver.find_elements_by_xpath('//div[#class="break-text ng-binding ng-scope"]')
for d in data:
city = d.text.split('\n')[0].split(',')[0]
state = d.text.split('\n')[0].split(',')[1]
highschool = d.text.split('\n')[1]
print(city)
print(state.strip())
print(highschool)
Output:
Pinson
AL
Pinson Valley
I have a scraper that looks for pricing on particular product pages. I'm only interested in the current price - whether the product is on sale or not.
I store the identifying tags like this in a JSON file:
{
"some_ecommerce_site" : {
"product_name" : ["span", "data-test", "product-name"],
"breadcrumb" : ["div", "class", "breadcrumbs"],
"sale_price" : ["span", "data-test", "sale-price"],
"regular_price" : ["span", "data-test", "product-price"]
},
}
And have these functions to select current price and clean up the price text:
def get_pricing(rpi, spi):
sale_price = self.soup_object.find(spi[0], {spi[1] : spi[2]})
regular_price = self.soup_object.find(rpi[0], {rpi[1] : rpi[2]})
return sale_price if sale_price else regular_price
def get_text(obj):
return re.sub(r'\s\s+', '', obj.text.strip()).encode('utf-8')
Which are called by:
def get_ids(name_of_ecommerce_site):
with open('site_identifiers.json') as j:
return json.load(j)[name_of_ecommerce_site]
def get_data():
rpi = self.site_ids['regular_price']
spi = self.site_ids['sale_price']
product_price = self.get_text( self.get_pricing(rpi, spi) )
This works for all but one site so far because their pricing is formatted like so:
<div class="product-price">
<h3>
£15.00
<span class="price-standard">
£35.00
</span>
</h3>
</div>
So what product_price returns is "£15£35" instead of the desired "£15".
Is there a simple way to exclude the nested <span> which won't break for the working sites?
I thought a solution would be to get a list and select index 0, but checking the tag's contents, that won't work as it's a single item in the list:
>> print(type(regular_price))
>> <class 'bs4.element.Tag'>
>> print(regular_price.contents)
>> [u'\n', <h3>\n\n\xa325.00\n\n<span class="price-standard">\n\n\xa341.00\n</span>\n</h3>, u'\n']
I've tried creating a list out of the result's NavigableString elements then filtering out the empty strings:
filter(None, [self.get_text(unicode(x)) for x in sale_price.find_all(text=True)])
This fixes that one case, but breaks a few of the others (since they often have the currency in a different tag than the value amount) - I get back "£".
If you want to get the text without child element one.You can do like this
from bs4 import BeautifulSoup,NavigableString
html = """
<div class="product-price">
<h3>
£15.00
<span class="price-standard">
£35.00
</span>
</h3>
</div>
"""
bs = BeautifulSoup(html,"xml")
result = bs.find("div",{"class":"product-price"})
fr = [element for element in result.h3 if isinstance(element, NavigableString)]
print(fr[0])
question may be duplicate of this