I am following a price tracker tutorial I found on YT. However, when i try to do "python main.py" in the terminal, it shows me this:
(venv) julia#Julias-Maccie-3 pythonProject1 % python main.py
[]
[]
(venv) julia#Julias-Maccie-3 pythonProject1 %
Where the two [] are, it is supposed to show me the price and title of the product.
Here's my code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.lookfantastic.nl/olaplex-no.3-hair-perfector-100ml/11416400.html'
headers = {"User-Agent": 'My user agent'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find_all("div", class_="productName_title")
price = soup.find_all("div", class_="productPrice_price")
converted_price = price[0:4]
print(converted_price)
print(title)
Does anyone know how to solve this?
NOTE: I did fill in my user agent. Just removed it for the purpose of this question
Check your soup and adjust the tag names you expect to find:
title = soup.find_all("h1", class_="productName_title")
price = soup.find_all("p", class_="productPrice_price")
Output:
[<p class="productPrice_price" data-product-price="price">
€22,45
</p>, <p class="productPrice_price" data-product-price="price">
€22,45
</p>]
[<h1 class="productName_title" data-product-name="title">Olaplex No.3 Hair Perfector 100ml</h1>, <h1 class="productName_title" data-product-name="title">Olaplex No.3 Hair Perfector 100ml</h1>]
Be aware that find_all() will give you a ResultSet if you like to get only first information go with find() instead
title = soup.find("h1", class_="productName_title").get_text(strip=True)
price = soup.find("p", class_="productPrice_price").get_text(strip=True)
converted_price = price[1:]
Output:
22,45
Olaplex No.3 Hair Perfector 100ml
Related
I want to make a web scraper of Amazon.
But, It looks like that everydata is None type.
I found in google and there are many peoples who make a web scraper of Amazon.
Please, give me some advice to solve this Nonetype issue.
Here is my code:
import requests
from bs4 import BeautifulSoup
amazon_dir = requests.get("https://www.amazon.es/s?k=docking+station&__mk_es_ES=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=34FO3BVVCJS4V&sprefix=docking%2Caps%2C302&ref=nb_sb_ss_ts-doa-p_1_7")
amazon_soup = BeautifulSoup(amazon_dir.text, "html.parser")
product_table = amazon_soup.find("div", {"class": "sg-col-inner"})
print(product_table)
products = product_table.find("div", {"class": "a-section"})
name = products.find("span", {"class": "a-size-base-plus"})
rating = products.find("span", {"class": "a-icon-alt"})
price = products.find("span", {"class": "a-price-whole"})
print(name, rating, price)
Thank you
Portals may check header User-Agent to send different HTML for different browsers or devices and sometimes this can make problem to find elements on page.
But usually portals check this header to block scripts/bots.
For example requests sends User-Agent: python-requests/2.26.0.
If I use header User-Agent from real browser or at least shorter version Mozilla/5.0 then code works.
There is other problem.
There is almost 70 elements <div class="sg-col-inner" ...> and table is as 3th element but find() gives only first element. You have to use find_all() and later use [2] to get 3th element.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0',
}
url = "https://www.amazon.es/s?k=docking+station&__mk_es_ES=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=34FO3BVVCJS4V&sprefix=docking%2Caps%2C302&ref=nb_sb_ss_ts-doa-p_1_7"
response = requests.get(url, headers=headers)
print(response.text[:1000])
print('---')
amazon_soup = BeautifulSoup(response.text, "html.parser")
all_divs = amazon_soup.find_all("div", {"class": "sg-col-inner"})
print('len(all_divs):', len(all_divs))
print('---')
products = all_divs[3].find("div", {"class": "a-section"})
name = products.find("span", {"class": "a-size-base-plus"})
rating = products.find("span", {"class": "a-icon-alt"})
price = products.find("span", {"class": "a-price-whole"})
print('name:', name.text)
print('rating:', rating.text)
print('price:', price.text)
EDIT:
Version which display all products:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0',
}
url = "https://www.amazon.es/s?k=docking+station&__mk_es_ES=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=34FO3BVVCJS4V&sprefix=docking%2Caps%2C302&ref=nb_sb_ss_ts-doa-p_1_7"
response = requests.get(url, headers=headers)
#print(response.text[:1000])
#print('---')
soup = BeautifulSoup(response.text, "html.parser")
results = soup.find("div", {"class": "s-main-slot s-result-list s-search-results sg-row"})
all_products = results.find_all("div", {"class": "sg-col-inner"})
print('len(all_products):', len(all_products))
print('---')
for item in all_products:
name = item.find("span", {"class": "a-size-base-plus"})
rating = item.find("span", {"class": "a-icon-alt"})
price = item.find("span", {"class": "a-price-whole"})
if name:
print('name:', name.text)
if rating:
print('rating:', rating.text)
if price:
print('price:', price.text)
if name or rating or price:
print('---')
BTW:
From time to time portals refresh code and HTML on servers - so if you find tutorial then check how old it is. Older tutorials may not work because portals could changed something in code.
Many modern pages start using JavaScript to add elements but requests and BeautifulSoup can't run JavaScript. And this may need to use Selenium to control real web browser which can run JavaScript.
I am practicing on Beautiful Soup and am after a products price, description and item number. The first 2 are text and are easy to get. The third is an attribute of the tag data-trade-price as seen below:-
<div class="price-group display-metro has-promo-price medium ng-scope" ng-class="{'has-trade-price': ShowTrade}" data-trade-price="221043">
I am after the numbers such as 221043 which is loaded in by the page. IE - all 24 item numbers matching all 24 products
My code is:-
import requests
r = requests.get('http://www.supercheapauto.com.au/store/car-care/wash-wax-polish/1021762?page=1&pageSize=24&sort=-ProductSummaryPurchasesWeighted%2C-ProductSummaryPurchases')
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'lxml')
results = soup.find_all('div', class_='details')
for result in results:
try:
SKU = result.select_one("data-trade-price")
except AttributeError: SKU = "N/A"
DESC = result.find('div', class_='title').text.strip().upper()
PRICE = result.find('span', class_='currency').text.strip().upper()
print(SKU,'\t', DESC,'\t', PRICE)
What is the syntax to get the item number from the soup?
Sorry - I am after the syntax that can iterate through the page of 24 products and recover the 24 different item numbers. The example given was to show the part of the attribute value that I was after. I ran the given answer and it works. I am unsure of how to integrate into the code given as the variations I use do not. Any suggestions.
You can access the attribute just like a dictionary.
Ex:
from bs4 import BeautifulSoup
s = """<div class="price-group display-metro has-promo-price medium ng-scope" ng-class="{'has-trade-price': ShowTrade}" data-trade-price="221043"<\div>"""
soup = BeautifulSoup(s, "html.parser")
print( soup.find("div", class_="price-group display-metro has-promo-price medium ng-scope").attrs["data-trade-price"] )
or
print( soup.find("div", class_="price-group display-metro has-promo-price medium ng-scope")["data-trade-price"] )
Output:
221043
when i run this code it gives me an empty bracket. Im new to web scraping so i dont know what im doing wrong.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
#btw the space is also there in the html code
print(container)
results:
[]
What i tried is to grab the html code from the site, and to soup trough the li tags where all the information is stored so I can print out all the information in a for loop.
Also if someone wants to explain how to use BeautifulSoup we can always talk.
Thank you guys.
So a working code that grabs product and price would could look something like this.
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=laptop'
r = requests.get(url, headers={'User-Agent': 'Mozilla Firefox'})
soup = BeautifulSoup(r.text, 'html.parser')
container = soup.findAll('li', {'class': 's-result-item celwidget '})
for cont in container:
h2 = cont.h2.text.strip()
# Amazon lists prices in two ways. If one fails, use the other
try:
currency = cont.find('sup', {'class': 'sx-price-currency'}).text.strip()
price = currency + cont.find('span', {'class': 'sx-price-whole'}).text.strip()
except:
price = cont.find('span', {'class': 'a-size-base a-color-base'})
print('Product: {}, Price: {}'.format(h2, price))
Let me know if that helps you further...
I want to scrape name, url and description of companies as listed on google finance. So far I am successful in getting description and url but unable to fetch the name. In the source code of myUrl, name is 024 Pharma Inc. When I see the div, the class is named 'appbar-snippet-primary'. But still the code doesn't find it. I ma new to web scraping so may be I am missing something. Please guide me in this regard.
from bs4 import BeautifulSoup
import urllib
import csv
myUrl = 'https://www.google.com/finance?q=OTCMKTS%3AEEIG'
r = urllib.urlopen(myUrl).read()
soup = BeautifulSoup(r, 'html.parser')
name_box = soup.find('div', class_='appbar-snippet-primary') # !! This div is not found
#name = name_box.text
#print name
description = soup.find('div', class_='companySummary')
desc = description.text.strip()
#print desc
website = soup.find('div', class_='item')
site = website.text
#print site
from bs4 import BeautifulSoup
import requests
myUrl = 'https://www.google.com/finance?q=OTCMKTS%3AEEIG'
r = requests.get(myUrl).content
soup = BeautifulSoup(r, 'html.parser')
name = soup.find('title').text.split(':')[0] # !! This div is not found
#print name
description = soup.find('div', class_='companySummary')
desc = description.text.strip()
#print desc
website = soup.find('div', class_='item')
site = website.text
write soup.find_all() instead of soup.find()
I am trying to extract Company Name, address, and zipcode from [www.quicktransportsolutions.com][1]. I have written the following code to scrawl the site and return the information I need.
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('div', {'class': 'well well-sm'}):
title = link.string
print(link)
trade_spider(1)
After running the code, I see the information that I want, but I am confused to how to get it to print without all of the non-pertinent information.
Above the
print(link)
I thought that I could have link.string pull the Company Names, but that failed. Any suggestions?
Output:
div class="well well-sm">
<b>2 OLD BOYS TRUCKING LLC</b><br><u><span itemprop="name"><b>2 OLD BOYS TRUCKING</b></span></u><br> <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">227 E 2ND</span>
<br>
<span itemprop="addressLocality">Adrian</span>, <span itemprop="addressRegion">MO</span> <span itemprop="postalCode">64720</span></br></span><br>
Trucks: 2 Drivers: 2<br>
<abbr class="initialism" title="Unique Number to identify Companies operating commercial vehicles to transport passengers or haul cargo in interstate commerce">USDOT</abbr> 2474795 <br><span class="glyphicon glyphicon-phone"></span><b itemprop="telephone"> 417-955-0651</b>
<br><a href="/inspectionreports/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Trucking Company 2 OLD BOYS TRUCKING Inspection Reports">
Everyone,
Thanks for the help so far... I'm trying to add an extra function to my little crawler. I have written the following code:
def Crawl_State_Page(max_pages):
url = 'http://www.quicktransportsolutions.com/carrier/alabama/trucking-companies.php'
while i <= len(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
table = soup.find("table", {"class" : "table table-condensed table-striped table-hover table-bordered"})
for link in table.find_all(href=True):
print link['href']
Output:
abbeville.php
adamsville.php
addison.php
adger.php
akron.php
alabaster.php
alberta.php
albertville.php
alexander-city.php
alexandria.php
aliceville.php
alpine.php
... # goes all the way to Z I cut the output short for spacing..
What I'm trying to accomplish here is to pull all of the href with the city.php and write it to a file. .. But right now, i am stuck in an infinite loop where it keep cycling through the URL. Any tips on how to increment it? My end goal is to create another function that feeds back into my trade_spider with the www.site.com/state/city.php and then loops through all 50 dates... Something to the effect of
while i < len(states,cities):
url = "http://www.quicktransportsolutions.com/carrier" + states + cities[i] +"
And then this would loop into my trade_spider function, pulling all of the information that I needed.
But, before I get to that part, I need a bit of help getting out of my infinite loop. Any suggestions? Or foreseeable issues that I am going to run into?
I tried to create a crawler that would cycle through every link on the page, and then if it found content on the page that trade_spider could crawl, it would write it to a file... However, that was a bit out of my skill set, for now. So, i'm trying this method.
I would rely on the itemprop attributes of the different tags for each company. They are conveniently set for name, url, address etc:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
response = requests.get(url)
soup = BeautifulSoup(response.content)
for company in soup.find_all('div', {'class': 'well well-sm'}):
link = company.find('a', itemprop='url').get('href').strip()
name = company.find('span', itemprop='name').text.strip()
address = company.find('span', itemprop='address').text.strip()
print name, link, address
print "----"
trade_spider(1)
Prints:
2 OLD BOYS TRUCKING /truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php 227 E 2ND
Adrian, MO 64720
----
HILLTOP SERVICE & EQUIPMENT /truckingcompany/missouri/hilltop-service-equipment-usdot-1047604.php ROUTE 2 BOX 453
Adrian, MO 64720
----