python xpath selector from div - python

I started playing with python and come across something that should be very simple but I cannot make it work...
I had below HTML
<h2 class="sr-only">Available Products</h2>
<div id="productlistcontainer" data-defaultpageno="1" data-descfilter="" class="columns4 columnsmobile2" data-noproductstext="No Products Found" data-defaultsortorder="rank" data-fltrselectedcurrency="GBP" data-category="Category1" data-productidstodisableshortcutbuttons="976516" data-defaultpagelength="100" data-searchtermcategory="" data-noofitemsingtmpost="25">
<ul id="navlist" class="s-productscontainer2">
What I need is to use parser.xpath to get value of data-category element.
Im trying for example:
cgy = xpath('//div["data-category"]')
What Im doing wrong ?

Try Selenium webdriver with python.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("url here")
element=driver.find_element_by_xpath("//div[#id='productlistcontainer']")
print(element.get_attribute('data-category'))
Or you can use Beautifulsoup which is python library.
from bs4 import BeautifulSoup
doc = """
<h2 class="sr-only">Available Products</h2>
<div id="productlistcontainer" data-defaultpageno="1" data-descfilter="" class="columns4 columnsmobile2" data-noproductstext="No Products Found" data-defaultsortorder="rank" data-fltrselectedcurrency="GBP" data-category="Category1" data-productidstodisableshortcutbuttons="976516" data-defaultpagelength="100" data-searchtermcategory="" data-noofitemsingtmpost="25">
<ul id="navlist" class="s-productscontainer2">
"""
soup = BeautifulSoup(doc,'html.parser')
print(soup.select_one('div#productlistcontainer')['data-category'])

Personally I use lxml html to do my parsing because it is fast and easy to work with in my opinion. I could of shorten up how the category is actually being extracted but I wanted to show you as much detail as possible so you can understand what is going on.
from lxml import html
def extract_data_category(tree):
elements = [
e
for e in tree.cssselect('div#productlistcontainer')
if e.get('data-category') is not None
]
element = elements[0]
content = element.get('data-category')
return content
response = """
<h2 class="sr-only">Available Products</h2>
<div id="productlistcontainer" data-defaultpageno="1" data-descfilter="" class="columns4 columnsmobile2" data-noproductstext="No Products Found" data-defaultsortorder="rank" data-fltrselectedcurrency="GBP" data-category="Category1" data-productidstodisableshortcutbuttons="976516" data-defaultpagelength="100" data-searchtermcategory="" data-noofitemsingtmpost="25">
<ul id="navlist" class="s-productscontainer2">
"""
tree = html.fromstring(response)
data_category = extract_data_category(tree)
print (data_category)

Related

with BeautifulSoup extract text from div in a href in loop

<div class="ELEMENT1">
<div class="ELEMENT2">
<div class="ELEMENT3">valeur1</div>
<div class="ELEMENT4">
<svg class="ELEMENT5 ">
<a href="ELEMENT6» target="ELEMENT7" class="ELEMENT8">
<div>TEXT</div
Hello to all,
My request is the following
From the following piece of code, I want to create a loop that allows me
to extract TEXT if and only if div class = ELEMENT 4 AND svg class = ELEMENT 5 (because there are other different ones)
thank you for your help
eddy
you'll need to import urllib2 or some other library that allows you to fetch a urls html structure. Then you need to import beautiful soup as well. Scrape the url and store into a variable. Then reformat the output in any way that serves your needs.
For example:
import urllib2
from bs4 import beautifulSoup
page = urlopen("the_url")
content = BeautifulSoup(page.read().decode("utf-8")) #decode data (utf-8)
filter = content.find_all("div") #finds all div elements in the body
Then you could use regexp to find the actual text inside the element.
Good luck on your assignment!

How to scrape an element from a website which belongs to more than one class using BeautifulSoup

I am trying to scrape elements from a website.
<h2 class="a b" data-test-search-result-header-title> Heading </h2>
How can I extract the value Heading from the website using BeautifulSoup?
I have tried the following codes :
Code 1 :
soup.find_all(h2,{'class':['a','b']})
Code 2:
soup.find_all(h2,class_='a b'})
Both the codes return an empty list.
How to resolve this?
Try to fix code2 to soup.find_all('h2',class_='a b')
Example:
Given are four h2 tags with its classes, soup.find_all('h2',class_='a b') get the first of them, cause it is matching the filter.
To get the text of the h2 element use .text, I have done it with
[heading.text for heading in soup.find_all('h2',class_='a b')]
cause we have to loop the find_all() result.
from bs4 import BeautifulSoup
html = """
<h2 class="a b"> Heading a and b </h2>
<h2 class="b a"> Heading b and a </h2>
<h2 class="a"> Heading a </h2>
<h2 class="b"> Heading b </h2>
"""
soup=BeautifulSoup(html,'html.parser')
[heading.text for heading in soup.find_all('h2',class_='a b')]
Output
[' Heading a and b ']
Further thoughts
You say, that it would not work for you - without providing further code/information, it is hard to help and more guessing. Let me show you what also could be a reason:
Let´s say you are scraping google results, there are a lot of options to do that, I just wanna show two approaches requests and selenium.
Requests Example
Inspected classes for h3 in browser are LC20lb DKV0Md
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.google.com/search?q=stackoverflow')
soup = BeautifulSoup(r.content, 'lxml')
headingsH3Class = soup.find_all('h3', class_='LC20lb DKV0Md')
headingsH3Only = soup.find_all('h3')
print(headingsH3Class[:2])
print(headingsH3Only[:2],'\n')
Requests Example Output
An empty list
[]
A list that show us that the inspected classes are not in the page content, we get back by requests
_
[<h3 class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd">Stack Overflow</div></h3>, <h3 class="zBAuLc"><div class="BNeawe vvjwJb AP7Wnd">Stack Overflow (Website) – Wikipedia</div></h3>]
Selenium Example
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://www.google.com/search?q=stackoverflow'
browser = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'lxml')
headingsH3Class = soup.find_all('h3', class_='LC20lb DKV0Md')
headingsH3Only = soup.find_all('h3')
print(headingsH3Class[:2])
print(headingsH3Only[:2])
browser.close()
Selenium Example Output
A List with exactly the h3 with it´s both classes we searched for.
_
[<h3 class="LC20lb DKV0Md"><span>Stack Overflow - Where Developers Learn, Share, & Build ...</span></h3>, <h3 class="LC20lb DKV0Md"><span>Stack Overflow (Website) – Wikipedia</span></h3>]
A list with all h3 Elements
_
[<h3 class="LC20lb DKV0Md"><span>Stack Overflow - Where Developers Learn, Share, & Build ...</span></h3>, <h3 class="r"><a class="l" data-ved="2ahUKEwj426uv9u3tAhUPohQKHYymBMAQjBAwAXoECAcQAQ" href="https://stackoverflow.com/questions" ping="/url?sa=t&source=web&rct=j&url=https://stackoverflow.com/questions&amp;ved=2ahUKEwj426uv9u3tAhUPohQKHYymBMAQjBAwAXoECAcQAQ">Questions</a></h3>]
Conclusion
Always check the data you are scraping, cause response and inspected things in browser can be different.

Python beatuifulsoup: extract value from div class

I want to build a program that automatically gets the live price of the german index (DAX). Therefore i use a website with the price provider FXCM.
In my code i use beautifulsoup and requests as packages. The div Box where the current value is stored looks like this :
<div class="left" data-item="quoteContainer" data-bg_quotepush="133962:74:bid">
<div class="wrapper cf">
<div class="left">
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="quote" data-bg_quotepush_c="40">13.599,24</span>
<span class="label" data-bg_quotepush="time" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="time" data-bg_quotepush_c="41">25.12.2020</span>
<span class="label"> • </span>
<span class="label" data-item="currency"></span>
</div>
<div class="right">
<span class="percent up" data-bg_quotepush="percent" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="percent" data-bg_quotepush_c="42">+0,00<span>%</span></span>
<span class="label up" data-bg_quotepush="change" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="change" data-bg_quotepush_c="43">0,00</span>
</div>
</div>
</div>
The value i want to have is the one after data-bg_quotepush_c="40" and has a vaulue of 13.599,24.
My Python code looks like this:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
response = rq.get(url)
soup = bs(response.text, "lxml")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price["data-bg_quotepush_c"])
It returns the following error:
File "C:\Users\Felix\anaconda3\lib\site-packages\bs4\element.py", line 1406, in __getitem__
return self.attrs[key]
KeyError: 'data-bg_quotepush_c'
Use Selenium instead of requests if working with dynamically generated content
What is going on?
Requesting the website with requests just provide the initial content, that not contains all the dynamically generatet information, so you can not find what your looking for.
To wait until website loaded completely use Selenium and sleep() as simple method or selenium waits in advanced.
Avoiding the error
Use price.text to get the text of the element that looks like this:
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_c="40" data-bg_quotepush_f="quote" data-bg_quotepush_i="133962:74:bid">13.599,24</span>
Example
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
driver.implicitly_wait(3)
soup = BeautifulSoup(driver.page_source,"html5lib")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price.text)
driver.close()
Output
13.599,24
if you scraping the value of div class try this, example
driver = webdriver.Chrome(YourPATH to driver)
from bs4 import BeautifulSoup
# create variable to store a url strings
url = 'https://news.guidants.com/#Ticker/Profil/?i=133962&e=74'
driver.get(url)
# scraping proccess
soup = BeautifulSoup(driver.page_source,"html5lib")
# parse
prices = soup.find_all("div", attrs={"class":"left"})
for price in prices:
total_price = price.find('span')
# close the driver
driver.close()
if you using requests module try use different parser
you can install with pip example html5lib
pip install html5lib
thanks

Scrapy: how to get links to users?

I'm trying to get links to group members:
response.css('.text--ellipsisOneLine::attr(href)').getall()
Why isn't this working?
html:
<div class="flex flex--row flex--noGutters flex--alignCenter">
<div class="flex-item _memberItem-module_name__BSx8i">
<a href="/ru-RU/Connect-IT-Meetup-in-Chisinau/members/280162178/profile/?returnPage=1">
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
</a>
</div>
</div>
Your selector isn't working because you are looking for a attribute (href) that this element doesn't have.
response.css('.text--ellipsisOneLine::attr(href)').getall()
This selector is searching for href inside elements of class text--ellipsisOneLine. In your HTML snippet that class matches only with this:
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
As you can see, there is no href attribute. Now, if you want the text between this h4 element you need to use ::text pseudo-element.
response.css('.text--ellipsisOneLine::text').getall()
Read more here.
I realize that this isn't scrapy, but personally for web scraping I use the requests module and BeautifulSoup4, and the following code snippet will get you a list of users with the aforementioned modules:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.meetup.com/ru-RU/Connect-IT-Meetup-in-Chisinau/members/')
if response.status_code == 200:
html_doc = response.text
html_source = BeautifulSoup(html_doc, 'html.parser')
users = html_source.findAll('h4')
for user in users:
print(user.text)
css:
response.css('.member-item .flex--alignCenter a::attr(href)').getall()

I am struggling with python-html . I know the class of a certain header. I need the info from the generic <a href... in this h1

So, I have this:
<h1 class='entry-title'>
<a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
</h1>
How can I retrieve the URL (it is not always the same) and the title (also not always the same)?
Parse it with an HTML parser, e.g. with BeautifulSoup it would be:
from bs4 import BeautifulSoup
data = "your HTML here" # data can be the result of urllib2.urlopen(url)
soup = BeautifulSoup(data)
link = soup.select("h1.entry-title > a")[0]
print link.get("href")
print link.get_text()
where h1.entry-title > a is a CSS selector matching an a element directly under h1 element with class="entry-title".
Well, just working with strings, you can
>>> s = '''<h1 class='entry-title'>
... <a href='http://theurlthatvariesinlengthbasedonwhenirequesthehtml'>theTitleIneedthatvariesinlength</a>
... </h1>'''
>>> s.split('>')[1].strip().split('=')[1].strip("'")
'http://theurlthatvariesinlengthbasedonwhenirequesthehtml'
>>> s.split('>')[2][:-3]
'theTitleIneedthatvariesinlength'
There are other (and better) options for parsing HTML though.

Categories