I am trying to scrape some info from a page with python and Beautiful soup and i cant seem to write the right path to what i need, the html is:
<div class="operator active" data-operator_name="Etisalat" data-
operator_id="5"><div class="operator_name_etisalat"></div></div>
And i am trying to get that operator name "Etisalat", i got this far:
def list_contries():
select = Select(driver.find_element_by_id('international_country'))
select.select_by_visible_text('France')
request = requests.get("https://mobilerecharge.com/buy/mobile_recharge?country=Afghanistan&operator=Etisalat")
content = request.content
soup = BeautifulSoup(content, "html.parser")
# print(soup.prettify())
prov=soup.find("div", {"class": "operator active"})['data-operator_name']
# prov = soup.find("div", {"class": "operator deselected"})
print(prov)
operator = (prov.text.strip())
But this just returns a NoneType .. so something is not right, can anyone please tell me what am i doing wrong ? Thanks.
You could use CSS selector. CSS selector [data-operator_name] will select any tag with attribute data-operator_name. Example with Beautiful Soup:
data = """<div class="operator active" data-operator_name="Etisalat" data-
operator_id="5"><div class="operator_name_etisalat"></div></div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.select_one('[data-operator_name]')['data-operator_name'])
This will print:
Etisalat
EDIT:
To select multiple tags with attribute "data-operator_name", use .select() method:
data = """<div class="operator active" data-operator_name="Etisalat" data-
operator_id="5"><div class="operator_name_etisalat"></div></div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for tag in soup.select('[data-operator_name]'):
print(tag['data-operator_name'])
Somehow, when I access the link from the browser, I am not able to see the field you are after unless I inspect the element. Hence, I have used Selenium in my answer.
from bs4 import BeautifulSoup
from selenium import webdriver
scrapeLink = 'https://mobilerecharge.com/buy/mobile_recharge?country=Afghanistan&operator=Etisalat'
driver = webdriver.Firefox(executable_path = 'C:\geckodriver.exe')
driver.get(scrapeLink)
html = driver.execute_script('return document.body.innerHTML')
driver.close()
soup = BeautifulSoup(html,'html.parser')
operator = len(soup.find_all('div', class_ = 'operator'))
for i in range(operator):
print(soup.find_all('div', class_ = 'operator')[i].get('data-operator_name'))
Output:
Roshan
Etisalat
MTN
Wireless
Related
I am trying to use Beautiful Soup to find all a tags that have an aria-label attribute (not trying to find a tags with any specific value for the attribute, just every tag that has the attribute in general). My code is shown below. When I run the code, I get an error indicating that the aria-label parameter cannot be parsed. How can I do this correctly?
url = 'https://www.encodeproject.org/search/?type=Experiment&control_type!=*&status=released&perturbed=false&assay_title=TF+ChIP-seq&assay_title=Histone+ChIP-seq&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&biosample_ontology.term_name=K562&biosample_ontology.term_name=HEK293&biosample_ontology.term_name=MCF-7&biosample_ontology.term_name=HepG2&limit=all'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
a_tags = soup.findAll('a', aria-label=True)
for tag in a_tags:
print(tag.text.strip())
You can use a CSS selector a[aria-label] which will select all a that have the attribute aria-label.
To use a CSS selector, use select() instead of find_all():
import requests
from bs4 import BeautifulSoup
url = 'https://www.encodeproject.org/search/?type=Experiment&control_type!=*&status=released&perturbed=false&assay_title=TF+ChIP-seq&assay_title=Histone+ChIP-seq&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&biosample_ontology.term_name=K562&biosample_ontology.term_name=HEK293&biosample_ontology.term_name=MCF-7&biosample_ontology.term_name=HepG2&limit=all'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
a_tags = soup.select('a[aria-label]')
for tag in a_tags:
print(tag.text.strip())
Or: use the attr= argument:
a_tags = soup.findAll('a', attrs={"aria-label": True})
Or: Check if aria-label is in the .attrs:
a_tags = soup.findAll(lambda tag: tag.name == "a" and "aria-label" in tag.attrs)
I'm trying to scrape from the BTC_USDT page from Binance.
From the resulting html I couldn't find the tags that I am looking for.
For eg: The live price is under <div class="subPrice css-4lmq3e", it is found when inspected but not when scraped. Although I could find some other tags like id="__APP".
Here is my code:
import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
url = "https://www.binance.com/en/trade/BTC_USDT?type=spot"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, "lxml")
div = soup.find("div", {"class" : "subPrice css-4lmq3e"})
content = f'Content : {str(div)}'
print(content)
you get can the current price of BTC_USDT from binance using there API.
https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT
https://github.com/binance/binance-spot-api-docs/blob/master/rest-api.md
I'm looking to extract all of the brands from this page using Beautiful Soup. My program so far is:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
def main():
opts = Options()
opts.headless = True
assert opts.headless # Operating in headless mode
browser = Firefox(options=opts)
browser.get('https://neighborhoodgoods.com/pages/brands')
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
brand = []
for tag in soup.find('table'):
brand.append(tag.contents.text)
print(brand)
browser.close()
print('This program is terminated.')
I'm struggling with figuring out the right tag to use as the data is nested in tr/td. Any advice? Thanks so much!
If I understand your question correctly, you only want to get the company name (the first <td> of each table)
Try using a CSS Selector td:nth-of-type(1) which selects the first <td> of every table.
import requests
from bs4 import BeautifulSoup
URL = "https://neighborhoodgoods.com/pages/brands"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
print([tag.text for tag in soup.select("td:nth-of-type(1)")])
Output:
['A.N Other', 'Act + Acre', ...And on.. , 'Wild One']
I am trying to web scrape from Zalora for 3 things:
1. item brand
2. item name
3. item price(old)
Below is my initial attempt:
from bs4 import BeautifulSoup
import requests
def make_soup(url):
html = requests.get(url)
bsObj = BeautifulSoup(html.text, 'html.parser')
return bsObj
soup = make_soup('https://www.zalora.com.hk/men/clothing/shirt/?gender=men&dir=desc&sort=popularity&category_id=31&enable_visual_sort=1')
itemBrand = soup.find("span",{"class":"b-catalogList__itmBrand fsm txtDark uc js-catalogProductTitle"})
itemName = soup.find("em",{"class":"b-catalogList__itmTitle fss"})
itemPrice = soup.find("span",{"class":"b-catalogList__itmPrice old"})
print(itemBrand, itemName, itemPrice)
Output:
None None None
Then I do further investigation:
productsCatalog = soup.find("ul",{"id":"productsCatalog"})
print(productsCatalog)
Output:
<ul class="b-catalogList__wrapper clearfix" id="productsCatalog">
This is the weird thing that puzzle me, there should be many tags within the ul tag (The 3 things I need are within those hidden tags), why are they not showing up?
Matter in fact, everything I try to scrape with BeautifulSoup within the ul tag have the output of None.
Since this content is rendered by JavaScript, you can't access it using the requests module. You should use selenium to automate your browser and then use BeautifulSoup to parse the actual html.
This is how you do it using selenium with chromedriver:
from selenium import webdriver
from bs4 import BeautifulSoup
chrome_driver = "path\\to\\chromedriver.exe"
driver = webdriver.Chrome(executable_path=chrome_driver)
target = 'https://www.zalora.com.hk/men/clothing/shirt/?gender=men&dir=desc&sort=popularity&category_id=31&enable_visual_sort=1'
driver.get(target)
soup = BeautifulSoup(driver.page_source, "lxml")
print(soup.find("span",{"class":"b-catalogList__itmBrand fsm txtDark uc js-catalogProductTitle"}).get_text().strip())
print(soup.find("span", {'class': 'b-catalogList__itmPrice old'}).get_text().strip())
print(soup.find("em",{"class":"b-catalogList__itmTitle fss"}).get_text().strip())
Output:
JAXON
HK$ 149.00
EMBROIDERY SHORT SLEEVE SHIRT
I'm trying to parse this website and get information about auto in content-box card__body with BeautifulSoup.find but it doesn't find all classes. I also tried webdriver.PhantomJS(), but it also showed nothing.
Here is my code:
from bs4 import BeautifulSoup
from selenium import webdriver
url='http://www.autobody.ru/catalog/10230/217881/'
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html5lib')
JTitems = soup.find("div", attrs={"class":"content-box__strong red-text card__price"})
JTitems
or
w = soup.find("div", attrs={"class":"content-box card__body"})
w
Why doesn't this approach work? What should I do to get all the information about auto? I am using Python 2.7.
Find the table where your necessary info are. Then find all the tr and loop through them to get the texts.
from bs4 import BeautifulSoup
from selenium import webdriver
url='http://www.autobody.ru/catalog/10230/217881/'
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
price = soup.find('div', class_='price').get_text()
print(price)
for tr in soup.find('table', class_='tech').find_all('tr'):
print(tr.get_text())
browser.close()
browser.quit()