Getting Unwanted response From requsets Python

Getting Unwanted response From requsets Python - python

I am trying to get response of
https://us.vestiairecollective.com/members/profile-2241096.shtml#currentpgn=2
but when visiting with requests get method i got response of
https://us.vestiairecollective.com/members/profile-2241096.shtml#currentpgn=0
facing this problem for every parameter number of currentpgn
from requests import get
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
resp = get("https://us.vestiairecollective.com/members/profile-2241096.shtml#currentpgn=2", headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
divs = soup.find('div', class_='catalog-list medium').find_all("div", recursive=False)
for div in divs:
print(div.a['href'])```

The data is loaded via Ajax calls. But you can replicate the ajax calls with requests module. For example (this prints first 10 pages of product brands, names and prices):
import re
import requests
from bs4 import BeautifulSoup
url = 'https://us.vestiairecollective.com/members/profile-2241096.shtml#currentpgn=2'
ajax_url = 'https://us.vestiairecollective.com/profil.shtml'
id_profil = re.search(r'-(\d+)\.', url).group(1)
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
data = {'ajx':'1', 'limit':'60', 'step':'2', 'ajax_var':'sell', 'id_profil':id_profil, 'filterSold':'1'}
for page in range(0, 10):
print('Processing page {}...'.format(page))
print('-' * 80)
data['step'] = page
soup = BeautifulSoup(requests.post(ajax_url, data=data, headers=headers).json()['result'], 'html.parser' )
for brand, name, price in zip(soup.select('.productItem .brand'),
soup.select('.productItem .name'),
soup.select('.productItem .price')):
print(brand.text, name.text, price.text)
Prints:
Processing page 0...
--------------------------------------------------------------------------------
LOUIS VUITTON Patent leather flip flops 226 €
JEAN PAUL GAULTIER Jacket 150 €
MIU MIU Leather hair accessory 299 €
CHANEL Belt 696 €
ETRO Hair accessory 149 €
DOLCE & GABBANA Mink bag charm 285 €
CHANEL Leather heels 636 €
CHANEL Silk shirt 490 €
VALENTINO GARAVANI Patent leather wallet 199 €
... and so on.

Related

Webscrape e-commerce

I'm a beginner in webscraping using python - however I need to use it frequently.
I'm trying to webscrape e-shop for mobiles to get item name & price.
website: https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false
My code "using User-agent" technique is as below:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}
web_page = requests.get(url,headers=headers)
soup = BeautifulSoup(web_page.content, "html.parser")
product_list = soup.find_all('div', class_='col-md-6 col-lg-4 mb-4')
product_list
output: [] -> empty lists
I'm not sure I'm doing right, also when i look at page source-code, I find no information.

That page is being loaded initially, then further hydrated from an api (with html).
This is one way to get those products sold by Orange Egypt:
from bs4 import BeautifulSoup as bs
import requests
from tqdm import tqdm ## if using jupyter notebook, import as: from tqdm.notebook import tqdm
import pandas as pd
headers = {
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
s = requests.Session()
s.headers.update(headers)
big_list = []
for x in tqdm(range(1, 16)):
url = f'https://shop.orange.eg/en/catalog/ListCategoryProducts?IsMobile=false&pagenumber={x}&categoryId=24'
r = s.get(url)
soup = bs(r.text, 'html.parser')
devices = soup.select('[class^="card device-card"]')
for d in devices:
product_title = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('title')
product_price = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('data-gtm-click-price')
product_link = d.select_one('h4[class^="card-title"] a[name="ancProduct"]').get('href')
big_list.append((product_title, product_price, product_link))
df = pd.DataFrame(big_list, columns = ['Product', 'Price', 'Url'])
print(df)
Result:
Product Price Url
0 Samsung Galaxy Z Fold4 5G 46690.0000 //shop.orange.eg/en/mobiles/samsung-mobiles/samsung-galaxy-z-fold4-5g
1 ASUS Vivobook Flip 14 9999.0000 //shop.orange.eg/en/devices/tablets-and-laptops/asus-vivobook-flip-14
2 Acer Aspire 3 A315-56 7299.0000 //shop.orange.eg/en/devices/tablets-and-laptops/acer-aspire-3-a315-56
3 Lenovo IdeaPad 3 15IGL05 5777.0000 //shop.orange.eg/en/devices/tablets-and-laptops/lenovo-tablets/lenovo-ideapad-3-15igl05
4 Lenovo IdeaPad Flex 5 16199.0000 //shop.orange.eg/en/devices/tablets-and-laptops/lenovo-tablets/lenovo-ideapad-flex-5
... ... ... ...
171 Eufy P1 Scale Wireless Smart Digital 699.0000 //shop.orange.eg/en/devices/accessories/scale-wireless/eufy-p1-scale-wireless-smart-digital
172 Samsung Smart TV 50AU7000 9225.0000 //shop.orange.eg/en/devices/smart-tv/samsung-tv-50tu7000
173 Samsung Smart TV 43T5300 6999.0000 //shop.orange.eg/en/devices/smart-tv/samsung-tv-43t5300
174 Samsung Galaxy A22 4460.0000 //shop.orange.eg/en/mobiles/samsung-mobiles/samsung-galaxy-a22
175 Eufy eufycam 2 2 plus 1 kit 4999.0000 //shop.orange.eg/en/devices/accessories/camera-wireless/eufy-eufycam-2-2-plus-1-kit
176 rows × 3 columns
For TQDM visit https://pypi.org/project/tqdm/
For Requests documentation, see https://requests.readthedocs.io/en/latest/
Also for pandas: https://pandas.pydata.org/pandas-docs/stable/index.html
And for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

The webpage is loaded dynamically from external source via AJAX . So you have to use API url instead.
Example:
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://shop.orange.eg/en/mobiles-and-devices?IsMobile=false'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"}
ajax_url = 'https://shop.orange.eg/en/catalog/ListCategoryProducts'
params = {
'IsMobile':'false',
'pagenumber': '2',
'categoryId': '24'
}
for params['pagenumber'] in range(1,2):
web_page = requests.get(ajax_url,headers=headers,params=params)
time.sleep(5)
soup = BeautifulSoup(web_page.content, "html.parser")
product_list = soup.find_all('div', class_='col-md-6 col-lg-4 mb-4')
for product in product_list:
title=product.h4.get_text(strip=True)
print(title)
Output:
Samsung MobilesSamsung Galaxy Z Fold4 5G
Tablets and LaptopsASUS Vivobook Flip 14
Tablets and LaptopsAcer Aspire 3 A315-56
Lenovo TabletsLenovo IdeaPad 3 15IGL05
Lenovo TabletsLenovo IdeaPad Flex 5
Samsung MobilesSamsung Galaxy S22 Ultra 5G
WearablesApple Watch Series 7
Samsung MobilesSamsung Galaxy Note 20 Ultra
GamingLenovo IdeaPad Gaming 3
Tablets and LaptopsSamsung Galaxy Tab S8 5G
Wireless ChargerLanex Charger Wireless Magnetic 3-in-1 15W
AccessoriesAnker Sound core R100

Scraping returning None

I am trying to scrape yellow pages everything working fine except scraping the phone numbers! it's a div class = 'popover-phones' but having an a tag with href = the phone number can anyone assist me please. yellow pages inspection
import item as item
import requests
from bs4 import BeautifulSoup
import json
from csv import writer
url = 'https://yellowpages.com.eg/en/category/charcoal'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
articles = soup.find_all('div', class_= 'col-xs-12 item-details')
for item in articles:
address = item.find('a',class_= 'address-text').text
company = item.find('a',class_= 'item-title').text
telephone = item.find('div', class_='popover-phones')enter code here
print(company,address,telephone)

The phone numbers you see are loaded from external URL. To get all phone numbers from the page you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://yellowpages.com.eg/en/category/charcoal"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for p in soup.select("[data-tooltip-phones]"):
phone_url = "https://yellowpages.com.eg" + p["data-tooltip-phones"]
title = p.find_previous(class_="item-title").text
phones = requests.get(phone_url).json()
print(title, *[b for a in phones for b in a])
Prints:
2 Bacco 02-3390-8764
3 A Group International 0120-3530-005 057-2428-449
3 A Group International 0120-3833-500 0120-3530-005
Abdel Karim 0122-3507-461
Abdel Sabour Zidan 03-4864-641
Abou Aoday 0111-9226-536 0100-3958-351
Abou Eid For Charcoal Trading 0110-0494-770
Abou Fares For Charcoal Trade 0128-3380-916
Abou Karim Store 0100-6406-939
Adel Sons 0112-1034-398 0115-0980-776
Afandina 0121-2414-087
Ahmed El Fahham 02-2656-0815
Al Baraka For Charcoal 0114-6157-799 0109-3325-720
Al Ghader For Import & Export 03-5919-355 0111-0162-602 0120-6868-434
Al Mashd For Coal 0101-0013-743 0101-0013-743
Al Zahraa Co. For Exporting Charcoal & Agriculture Products 040-3271-056 0100-0005-174 040-3271-056
Alex Carbon Group 03-3935-902
Alwaha Charcoal Trade Est. 0100-4472-554 0110-1010-810 0100-9210-812
Aly Abdel Rahman For Charcoal Trade 03-4804-440 0122-8220-661
Amy Deluxe Egypt 0112-5444-410

Pagination not iterating over pages

Want to iterate all pages from this url ""url = "https://www.iata.org/en/about/members/airline-list/"" and dump the results in a .csv file.
How could implementing a piece of code to iterate through the pages be included in the current code below?
import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request
url = 'https://www.iata.org/en/about/members/airline-list/'
req = Request(url , headers = {
'accept':'*/*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'})
data = []
while True:
print(url)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
data.append(pd.read_html(soup.select_one('table.datatable').prettify())[0])
if soup.select_one('span.pagination-link.is-active + div a[href]'):
url = soup.select_one('span.pagination-link.is-active + div a')['href']
else:
break
df = pd.concat(data)
df.to_csv('airline-list.csv',encoding='utf-8-sig',index=False)

Try this approach:
for i in range(1, 30):
url = f'https://www.iata.org/en/about/members/airline-list/?page={i}&search=&ordering=Alphabetical'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
data.append(pd.read_html(soup.select_one('table.datatable').prettify())[0])

To get data dynamically, use:
import pandas as pd
import requests
import bs4
url = 'https://www.iata.org/en/about/members/airline-list/?page={page}&search=&ordering=Alphabetical'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
# Total number of pages
html = requests.get(url.format(page=1), headers=headers)
soup = bs4.BeautifulSoup(html.text)
pages = int(soup.find_all('a', {'class': 'pagination-link'})[-2].text)
data = []
for page in range(1, pages+1):
html = requests.get(url.format(page=page, headers=headers))
data.append(pd.read_html(html.text)[0])
df = pd.concat(data)
Output:
>>> df
Airline Name IATA Designator 3 digit code ICAO code Country / Territory
0 ABX Air GB 832 ABX United States
1 Aegean Airlines A3 390 AEE Greece
2 Aer Lingus EI 53 EIN Ireland
3 Aero Republica P5 845 RPB Colombia
4 Aeroflot SU 555 AFL Russian Federation
.. ... ... ... ... ...
3 WestJet WS 838 WJA Canada
4 White coloured by you WI 97 WHT Portugal
5 Wideroe WF 701 WIF Norway
6 Xiamen Airlines MF 731 CXA China (People's Republic of)
7 YTO Cargo Airlines YG 860 HYT China (People's Republic of)
[288 rows x 5 columns]

Scraping data in columns with no period

I'm trying to grab just the Symbol column excluding any symbols with a period in them (website here). So far I can scrape the page and get the entire html into a variable, however I am struggling to extract just the symbols with no period. Here's the code:
import bs4
import requests
import re
url = 'https://stockcharts.com/def/servlet/SC.scan?s=I.Y|TSAL[t.t_eq_s]![as0,20,tv_gt_40000]!
[wr_eq_1]&report=predefalli' #base url to get the pages count
requests.sessions = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
response = requests.get(url,headers=requests.sessions)
soup = bs4.BeautifulSoup(response.text,'lxml')
for item in soup.find_all(attrs={'class': 'icon icon-square icon-scc-pos-sq-sharp'}):
for link in item.find_all('a'):
print link.get('href')

Try:
import requests
from io import StringIO
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0"
}
url = "https://stockcharts.com/def/servlet/SC.scan?s=I.Y%7CTSAL%5Bt.t_eq_s%5D!%5Bas0,20,tv_gt_40000%5D!%5Bwr_eq_1%5D&report=predefalli"
t = requests.get(url, headers=headers).text
df = pd.read_html(StringIO(t))[0]
# filter-out symbols with dot (.)
df.pop("Unnamed: 0")
print(df[~df["Symbol"].str.contains(r"\.")])
Prints:
Symbol Name Exchange Sector Industry SCTR U Close Volume
1 AB AllianceBernstein Holding LP NYSE Financial Asset Managers 92.9 mid 49.918 111285
2 ABG Asbury Automotive Group Inc. NYSE Consumer Discretionary Specialty Retailers 75.8 mid 196.210 47977
3 ABR Arbor Realty Trust Inc. NYSE Real Estate Mortgage REITs 79.9 mid 18.400 456439
4 ACA Arcosa, Inc. NYSE Industrial Commercial Vehicles 15.3 mid 51.980 52470
6 ACH Aluminum Corp. of China NYSE Materials Aluminum 96.8 mid 14.830 22646
...

Web Scraping returns None

I'm trying to get the price of a list of monitors from Amazon, using request and bs4 -
Here is the code:
from bs4 import BeautifulSoup
import re
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',}
res = requests.get("https://www.amazon.com/s?i=specialty-aps&bbn=16225007011&rh=n%3A16225007011%2Cn%3A1292115011&ref=nav_em__nav_desktop_sa_intl_monitors_0_2_6_8", headers=headers)
print(res)
soup = BeautifulSoup(res.text, "html.parser")
price=soup.find_all(class_="a-price-whole")
print(price.text)
I don't understand why it returns None - I'm basically following a video, https://www.youtube.com/watch?v=Bg9r_yLk7VY&t=467s&ab_channel=DevEd, and on their side it returns the text - can someone point out what I'm doing wrong?

You've probably received captcha page. Try to add "Accept-Language" HTTP header:
import re
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.5",
}
res = requests.get(
"https://www.amazon.com/s?i=specialty-aps&bbn=16225007011&rh=n%3A16225007011%2Cn%3A1292115011&ref=nav_em__nav_desktop_sa_intl_monitors_0_2_6_8",
headers=headers,
)
soup = BeautifulSoup(res.text, "html.parser")
prices = soup.find_all(class_="a-price-whole")
for price in prices:
print(
price.find_previous("h2").text[:30] + "...",
price.text + price.find_next(class_="a-price-fraction").text,
)
Prints:
Sceptre IPS 27-Inch Business C... 159.17
EVICIV 12.3’’ Raspberry Pi Tou... 199.99
Portable Monitor, 17.3'' IPS H... 349.99
Acer R240HY bidx 23.8-Inch IPS... 129.99
Dell SE2419Hx 24" IPS Full HD ... 169.95
HP Pavilion 22cwa 21.5-Inch Fu... 139.99
Sceptre E248W-19203R 24" Ultra... 127.98
LG 27GL83A-B 27 Inch Ultragear... 379.99
LG 24M47VQ 24-Inch LED-lit Mon... 99.99
LG 27UN850-W 27 Inch Ultrafine... 404.14
Sceptre IPS 24-Inch Business C... 142.17
Planar PXN2400 Full HD Thin Pr... 139.00
Sceptre IPS 24-Inch Business C... 142.17
Portable Triple Screen Laptop ... 419.99
ASUS ZenScreen 15.6" 1080P Por... 232.52
HP M27ha FHD Monitor - Full HD... 199.99
ASUS 24" 1080P Gaming Monitor ... 189.99
Dell P2419H 24 Inch LED-Backli... 187.99
LG 32QN600-B 32-Inch QHD (2560... 249.99
LG 29WN600-W 29" 21:9 UltraWid... 226.99
Acer Nitro XV272U Pbmiiprzx 27... 299.99
AOC C24G1 24" Curved Frameless... 186.99
Samsung CF390 Series 27 inch F... 199.00
ASUS VY279HE 27” Eye Care Moni... 219.00
SAMSUNG LC24F396FHNXZA 23.5" F... 149.99
Sceptre E275W-19203R 27" Ultra... 169.97
ASUS VG245H 24 inchFull HD 108... 164.95
PEPPER JOBS 15.6" USB-C Portab... 199.99
13.3 inch Portable Monitor,KEN... 96.99
Eyoyo Small Monitor 8 inch Min... 76.98

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting Unwanted response From requsets Python - python

Related

Webscrape e-commerce

Scraping returning None

Pagination not iterating over pages

Scraping data in columns with no period

Web Scraping returns None

Categories

Resources