I'm trying to get the price of a list of monitors from Amazon, using request and bs4 -
Here is the code:
from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',}
res = requests.get("https://www.amazon.com/s?i=specialty-aps&bbn=16225007011&rh=n%3A16225007011%2Cn%3A1292115011&ref=nav_em__nav_desktop_sa_intl_monitors_0_2_6_8", headers=headers)
print(res)
soup = BeautifulSoup(res.text, "html.parser")
price=soup.find_all(class_="a-price-whole")
print(price.text)
I don't understand why it returns None - I'm basically following a video, https://www.youtube.com/watch?v=Bg9r_yLk7VY&t=467s&ab_channel=DevEd, and on their side it returns the text - can someone point out what I'm doing wrong?
You've probably received captcha page. Try to add "Accept-Language" HTTP header:
import re
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.5",
}
res = requests.get(
"https://www.amazon.com/s?i=specialty-aps&bbn=16225007011&rh=n%3A16225007011%2Cn%3A1292115011&ref=nav_em__nav_desktop_sa_intl_monitors_0_2_6_8",
headers=headers,
)
soup = BeautifulSoup(res.text, "html.parser")
prices = soup.find_all(class_="a-price-whole")
for price in prices:
print(
price.find_previous("h2").text[:30] + "...",
price.text + price.find_next(class_="a-price-fraction").text,
)
Prints:
Sceptre IPS 27-Inch Business C... 159.17
EVICIV 12.3’’ Raspberry Pi Tou... 199.99
Portable Monitor, 17.3'' IPS H... 349.99
Acer R240HY bidx 23.8-Inch IPS... 129.99
Dell SE2419Hx 24" IPS Full HD ... 169.95
HP Pavilion 22cwa 21.5-Inch Fu... 139.99
Sceptre E248W-19203R 24" Ultra... 127.98
LG 27GL83A-B 27 Inch Ultragear... 379.99
LG 24M47VQ 24-Inch LED-lit Mon... 99.99
LG 27UN850-W 27 Inch Ultrafine... 404.14
Sceptre IPS 24-Inch Business C... 142.17
Planar PXN2400 Full HD Thin Pr... 139.00
Sceptre IPS 24-Inch Business C... 142.17
Portable Triple Screen Laptop ... 419.99
ASUS ZenScreen 15.6" 1080P Por... 232.52
HP M27ha FHD Monitor - Full HD... 199.99
ASUS 24" 1080P Gaming Monitor ... 189.99
Dell P2419H 24 Inch LED-Backli... 187.99
LG 32QN600-B 32-Inch QHD (2560... 249.99
LG 29WN600-W 29" 21:9 UltraWid... 226.99
Acer Nitro XV272U Pbmiiprzx 27... 299.99
AOC C24G1 24" Curved Frameless... 186.99
Samsung CF390 Series 27 inch F... 199.00
ASUS VY279HE 27” Eye Care Moni... 219.00
SAMSUNG LC24F396FHNXZA 23.5" F... 149.99
Sceptre E275W-19203R 27" Ultra... 169.97
ASUS VG245H 24 inchFull HD 108... 164.95
PEPPER JOBS 15.6" USB-C Portab... 199.99
13.3 inch Portable Monitor,KEN... 96.99
Eyoyo Small Monitor 8 inch Min... 76.98
Related
I'm a beginner in webscraping using python - however I need to use it frequently.
I'm trying to webscrape e-shop for mobiles to get item name & price.
website: https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false
My code "using User-agent" technique is as below:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}
web_page = requests.get(url,headers=headers)
soup = BeautifulSoup(web_page.content, "html.parser")
product_list = soup.find_all('div', class_='col-md-6 col-lg-4 mb-4')
product_list
output: [] -> empty lists
I'm not sure I'm doing right, also when i look at page source-code, I find no information.
That page is being loaded initially, then further hydrated from an api (with html).
This is one way to get those products sold by Orange Egypt:
from bs4 import BeautifulSoup as bs
import requests
from tqdm import tqdm ## if using jupyter notebook, import as: from tqdm.notebook import tqdm
import pandas as pd
headers = {
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 16)):
url = f'https://shop.orange.eg/en/catalog/ListCategoryProducts?IsMobile=false&pagenumber={x}&categoryId=24'
r = s.get(url)
soup = bs(r.text, 'html.parser')
devices = soup.select('[class^="card device-card"]')
for d in devices:
product_title = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('title')
product_price = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('data-gtm-click-price')
product_link = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('href')
big_list.append((product_title, product_price, product_link))
df = pd.DataFrame(big_list, columns = ['Product', 'Price', 'Url'])
print(df)
Result:
Product Price Url
0 Samsung Galaxy Z Fold4 5G 46690.0000 //shop.orange.eg/en/mobiles/samsung-mobiles/samsung-galaxy-z-fold4-5g
1 ASUS Vivobook Flip 14 9999.0000 //shop.orange.eg/en/devices/tablets-and-laptops/asus-vivobook-flip-14
2 Acer Aspire 3 A315-56 7299.0000 //shop.orange.eg/en/devices/tablets-and-laptops/acer-aspire-3-a315-56
3 Lenovo IdeaPad 3 15IGL05 5777.0000 //shop.orange.eg/en/devices/tablets-and-laptops/lenovo-tablets/lenovo-ideapad-3-15igl05
4 Lenovo IdeaPad Flex 5 16199.0000 //shop.orange.eg/en/devices/tablets-and-laptops/lenovo-tablets/lenovo-ideapad-flex-5
... ... ... ...
171 Eufy P1 Scale Wireless Smart Digital 699.0000 //shop.orange.eg/en/devices/accessories/scale-wireless/eufy-p1-scale-wireless-smart-digital
172 Samsung Smart TV 50AU7000 9225.0000 //shop.orange.eg/en/devices/smart-tv/samsung-tv-50tu7000
173 Samsung Smart TV 43T5300 6999.0000 //shop.orange.eg/en/devices/smart-tv/samsung-tv-43t5300
174 Samsung Galaxy A22 4460.0000 //shop.orange.eg/en/mobiles/samsung-mobiles/samsung-galaxy-a22
175 Eufy eufycam 2 2 plus 1 kit 4999.0000 //shop.orange.eg/en/devices/accessories/camera-wireless/eufy-eufycam-2-2-plus-1-kit
176 rows × 3 columns
For TQDM visit https://pypi.org/project/tqdm/
For Requests documentation, see https://requests.readthedocs.io/en/latest/
Also for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html
And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html
The webpage is loaded dynamically from external source via AJAX . So you have to use API url instead.
Example:
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}
ajax_url = 'https://shop.orange.eg/en/catalog/ListCategoryProducts'
params = {
'IsMobile':'false',
'pagenumber': '2',
'categoryId': '24'
}
for params['pagenumber'] in range(1,2):
web_page = requests.get(ajax_url,headers=headers,params=params)
time.sleep(5)
soup = BeautifulSoup(web_page.content, "html.parser")
product_list = soup.find_all('div', class_='col-md-6 col-lg-4 mb-4')
for product in product_list:
title=product.h4.get_text(strip=True)
print(title)
Output:
Samsung MobilesSamsung Galaxy Z Fold4 5G
Tablets and LaptopsASUS Vivobook Flip 14
Tablets and LaptopsAcer Aspire 3 A315-56
Lenovo TabletsLenovo IdeaPad 3 15IGL05
Lenovo TabletsLenovo IdeaPad Flex 5
Samsung MobilesSamsung Galaxy S22 Ultra 5G
WearablesApple Watch Series 7
Samsung MobilesSamsung Galaxy Note 20 Ultra
GamingLenovo IdeaPad Gaming 3
Tablets and LaptopsSamsung Galaxy Tab S8 5G
Wireless ChargerLanex Charger Wireless Magnetic 3-in-1 15W
AccessoriesAnker Sound core R100
I'm trying to grab just the Symbol column excluding any symbols with a period in them (website here). So far I can scrape the page and get the entire html into a variable, however I am struggling to extract just the symbols with no period. Here's the code:
import bs4
import requests
import re
url = 'https://stockcharts.com/def/servlet/SC.scan?s=I.Y|TSAL[t.t_eq_s]![as0,20,tv_gt_40000]!
[wr_eq_1]&report=predefalli' #base url to get the pages count
requests.sessions = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
response = requests.get(url,headers=requests.sessions)
soup = bs4.BeautifulSoup(response.text,'lxml')
for item in soup.find_all(attrs={'class': 'icon icon-square icon-scc-pos-sq-sharp'}):
for link in item.find_all('a'):
print link.get('href')
Try:
import requests
from io import StringIO
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"
}
url = "https://stockcharts.com/def/servlet/SC.scan?s=I.Y%7CTSAL%5Bt.t_eq_s%5D!%5Bas0,20,tv_gt_40000%5D!%5Bwr_eq_1%5D&report=predefalli"
t = requests.get(url, headers=headers).text
df = pd.read_html(StringIO(t))[0]
# filter-out symbols with dot (.)
df.pop("Unnamed: 0")
print(df[~df["Symbol"].str.contains(r"\.")])
Prints:
Symbol Name Exchange Sector Industry SCTR U Close Volume
1 AB AllianceBernstein Holding LP NYSE Financial Asset Managers 92.9 mid 49.918 111285
2 ABG Asbury Automotive Group Inc. NYSE Consumer Discretionary Specialty Retailers 75.8 mid 196.210 47977
3 ABR Arbor Realty Trust Inc. NYSE Real Estate Mortgage REITs 79.9 mid 18.400 456439
4 ACA Arcosa, Inc. NYSE Industrial Commercial Vehicles 15.3 mid 51.980 52470
6 ACH Aluminum Corp. of China NYSE Materials Aluminum 96.8 mid 14.830 22646
...
I've built a simple webscraper below that scrapes some information from the site https://www.thewhiskyexchange.com/new-products/standard-whisky every minute or so.
It's been working fine up until today and has suddenly stopped working. Changing to
product in soup.select('a'):
prints out:
[Chrome Web Store, Cloudflare]
Could this be an authentication issue caused by Cloudfare? Is there a way around this?
Full code:
import ssl
import requests
import sys
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
from datetime import datetime
import json
import random
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for i in range(1,4):
#Pick a random user agent
user_agent = random.choice(user_agent_list)
#Set the headers
headers = {'User-Agent': user_agent}
url = []
url = 'https://www.thewhiskyexchange.com/new-products/standard-whisky/'
response = requests.get(url,headers=headers)
bottles = []
link = []
product_name_old = []
link2 = []
link3 = []
soup = BeautifulSoup(response.text,features="html.parser")
oldlinks = []
product_name_old = []
for product in soup.select('li.product-grid__item'):
product_name_old.append(product.a.attrs['title'])
oldlinks.append(product.a.attrs['href'])
product_size_old = len(product_name_old)
print("Setup Complete", product_size_old)
link4 = "\n".join("{}\nhttps://www.thewhiskyexchange.com{}".format(x, y) for x, y in zip(product_name_old, oldlinks))
print(link4)
import trio
import httpx
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0'
}
async def main(url):
async with httpx.AsyncClient(timeout=None) as client:
client.headers.update(headers)
r = await client.get(url)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x['title'].strip(), url[:33]+x['href'])
for x in soup.select('.product-card')]
df = pd.DataFrame(goal, columns=['Title', 'Link'])
print(df)
if __name__ == "__main__":
trio.run(main, 'https://www.thewhiskyexchange.com/new-products/standard-whisky/')
Output:
Title Link
0 Macallan 18 Year Old Sherry Oak 2020 Release https://www.thewhiskyexchange.com/p/56447/maca...
1 Benriach The Thirty 30 Year Old https://www.thewhiskyexchange.com/p/60356/benr...
2 Maker's Mark Kentucky Mule Cocktail Kit https://www.thewhiskyexchange.com/p/61132/make...
3 Isle of Raasay Single Malt https://www.thewhiskyexchange.com/p/60558/isle...
4 Caol Ila 2001 19 Year Old Exclusive to The Whi... https://www.thewhiskyexchange.com/p/61099/caol...
.. ... ...
75 MB Roland Single Barrel Bourbon https://www.thewhiskyexchange.com/p/60403/mb-r...
76 Seven Seals The Age of Scorpio https://www.thewhiskyexchange.com/p/60373/seve...
77 Seven Seals The Age of Aquarius https://www.thewhiskyexchange.com/p/60372/seve...
78 Langatun 2016 Pedro Ximenez Sherry Cask Finish https://www.thewhiskyexchange.com/p/60371/lang...
79 Speyburn 2009 11 Year Old Sherry Cask Connoiss... https://www.thewhiskyexchange.com/p/60411/spey...
[80 rows x 2 columns]
I'm having a really hard time trying to scrape Amazon, my code works on every average page but when it comes to Amazon it's really frustrating.
I know I can use "FindAll" but I'm using this approach to "keep the flow" and get text and img alt simultaneous:
See
Multiple conditions in BeautifulSoup: Text=True & IMG Alt=True
This is my code:
url = "https://www.amazon.com/Best-Sellers-Health-Personal-Care-Foot-Arch-Supports/zgbs/hpc/3780091"
import requests
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
r = requests.get(url, headers=headers)
from bs4 import BeautifulSoup, Tag, NavigableString
soup = BeautifulSoup(r.content, 'html.parser')
def get_raw_text(s):
for t in s.contents:
if isinstance(t, Tag):
if t.name == 'img' and 'alt' in t.attrs:
yield t['alt']
yield from get_raw_text(t)
for text in get_raw_text(soup):
print(text)
and I get nothing.
Try changing the HTML parser to html5lib. First do this pip install html5lib and then try again with this:
import requests
from bs4 import BeautifulSoup, Tag
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
url = "https://www.amazon.com/Best-Sellers-Health-Personal-Care-Foot-Arch-Supports/zgbs/hpc/3780091"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
def get_raw_text(s):
for t in s.contents:
if isinstance(t, Tag):
if t.name == 'img' and 'alt' in t.attrs:
yield t['alt']
yield from get_raw_text(t)
for text in get_raw_text(soup):
print(text)
Output:
Dr. Scholl’s Tri-Comfort Insoles // Comfort for Heel, Arch and Ball of Foot with Targeted Cushioning and Arch Support…
Dr. Scholl’s Sport Insoles // Superior Shock Absorption and Arch Support to Reduce Muscle Fatigue and Stress on Lower…
Copper Compression Copper Arch Support - 2 Plantar Fasciitis Braces/Sleeves. Guaranteed Highest Copper Content. Foot…
Dr. Scholl’s Extra Support Insoles // Superior Shock Absorption and Reinforced Arch Support for Big & Tall Men To Reduce…
Arch Support,3 Pairs Compression Fasciitis Cushioned Support Sleeves, Plantar Fasciitis Foot Relief Cushions for Plantar…
LLSOARSS Plantar Fasciitis Feet Sandal with Arch Support - Best Orthotic flip Flops for Flat Feet,Heel Pain- for Women
Pcssole’s 3/4 Orthotics Shoe Insoles High Arch Supports Shoe Insoles for Plantar Fasciitis, Flat Feet, Over-Pronation…
and so on...
I am trying to get response of
https://us.vestiairecollective.com/members/profile-2241096.shtml#currentpgn=2
but when visiting with requests get method i got response of
https://us.vestiairecollective.com/members/profile-2241096.shtml#currentpgn=0
facing this problem for every parameter number of currentpgn
from requests import get
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
resp = get("https://us.vestiairecollective.com/members/profile-2241096.shtml#currentpgn=2", headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
divs = soup.find('div', class_='catalog-list medium').find_all("div", recursive=False)
for div in divs:
print(div.a['href'])```
The data is loaded via Ajax calls. But you can replicate the ajax calls with requests module. For example (this prints first 10 pages of product brands, names and prices):
import re
import requests
from bs4 import BeautifulSoup
url = 'https://us.vestiairecollective.com/members/profile-2241096.shtml#currentpgn=2'
ajax_url = 'https://us.vestiairecollective.com/profil.shtml'
id_profil = re.search(r'-(\d+)\.', url).group(1)
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
data = {'ajx':'1', 'limit':'60', 'step':'2', 'ajax_var':'sell', 'id_profil':id_profil, 'filterSold':'1'}
for page in range(0, 10):
print('Processing page {}...'.format(page))
print('-' * 80)
data['step'] = page
soup = BeautifulSoup(requests.post(ajax_url, data=data, headers=headers).json()['result'], 'html.parser' )
for brand, name, price in zip(soup.select('.productItem .brand'),
soup.select('.productItem .name'),
soup.select('.productItem .price')):
print(brand.text, name.text, price.text)
Prints:
Processing page 0...
--------------------------------------------------------------------------------
LOUIS VUITTON Patent leather flip flops 226 €
JEAN PAUL GAULTIER Jacket 150 €
MIU MIU Leather hair accessory 299 €
CHANEL Belt 696 €
ETRO Hair accessory 149 €
DOLCE & GABBANA Mink bag charm 285 €
CHANEL Leather heels 636 €
CHANEL Silk shirt 490 €
VALENTINO GARAVANI Patent leather wallet 199 €
... and so on.