How to scrape data from Amazon Canada?

How to scrape data from Amazon Canada? - python

I am trying to scrape data from amazon canada (amazon.ca). I am using requests and bs4 package to send & parse html data. I am not able to extract the data from the response. Can someone please help me in extracting information from response.
import requests
from bs4 import BeautifulSoup
# Define headers
headers={
'content-type': 'text/html;charset=UTF-8',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
}
# Amazon Canada product url
url = 'https://www.amazon.ca/INIU-High-Speed-Flashlight-Powerbank-Compatible/dp/B07CZDXDG8?ref_=Oct_s9_apbd_otopr_hd_bw_b3giFrP&pf_rd_r=69GE1K9DG49351YHSYBC&pf_rd_p=694b8fdf-0d96-57ba-b834-dc9bdeb7a094&pf_rd_s=merchandised-search-11&pf_rd_t=BROWSE&pf_rd_i=3379552011&th=1'
resp = requests.get(url,headers= header)
print(resp)
<Response [200]>
Earlier it was showing <Response [503]>, so I added headers, now it is showing <Response [200]>. So I am trying to extract some information from the page.
# Using html parser
soup = BeautifulSoup(resp.content,'lxml')
# Extracting information from page
product_title = soup.find('span',id='productTitle')
print('product_title -' ,product_title)
product_price = soup.find('span',id='priceblock_ourprice')
print('product_price -' ,product_price)
('product_title -', None)
('product_price -', None)
But it is showing None, So I checked what exactly data is present in soup. So I print the soup.
soup.text
'\n\n\n\nRobot Check\n\n\n\n\nif (true === true) {\n var ue_t0 = (+
new Date()),\n ue_csm = window,\n ue = { t0: ue_t0, d:
function() { return (+new Date() - ue_t0); } },\n ue_furl =
"fls-na.amazon.ca",\n ue_mid = "A2EUQ1WTGCTBG2",\n
ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],\n
ue_sn = "opfcaptcha.amazon.ca",\n ue_id =
\'0B2HQATTKET8J6M36Y3G\';\n}\n\n\n\n\n\n\n\n\n\n\n\nEnter the
characters you see below\nSorry, we just need to make sure you\'re not
a robot. For best results, please make sure your browser is accepting
cookies.\n\n\n\n\n\n\n\n\n\n\nType the characters you see in this
image:\n\n\n\n\n\n\n\n\nTry different
image\n\n\n\n\n\n\n\n\n\n\n\nContinue
shopping\n\n\n\n\n\n\n\n\n\n\n\nConditions of Use &
Sale\n\n\n\n\nPrivacy Notice\n\n\n \xa9 1996-2015,
Amazon.com, Inc. or its affiliates\n \n if (true
=== true) {\n document.write(\'<img src="https://fls-na.amaz\'+\'on.ca/\'+\'1/oc-csi/1/OP/requestId=0B2HQATTKET8J6M36Y3G&js=1"
/>\');\n };\n \n\n\n\n\n\n\n if (true === true)
{\n var head = document.getElementsByTagName(\'head\')[0],\n
prefix =
"https://images-na.ssl-images-amazon.com/images/G/01/csminstrumentation/",\n
elem = document.createElement("script");\n elem.src = prefix +
"csm-captcha-instrumentation.min.js";\n
head.appendChild(elem);\n\n elem =
document.createElement("script");\n elem.src = prefix +
"rd-script-6d68177fa6061598e9509dc4b5bdd08d.js";\n
head.appendChild(elem);\n }\n \n\n'
I checked the output throughly, but I didn't found any data available in the response, I even tried to do the same and checked in resp.content, but didn't found any data. Also I validated the url, the url is valid too. I even tested above script by adding public proxies, but still no output.
Can someone please help me extract information from the url or any other way to get it done?.

Try this:
import requests
from bs4 import BeautifulSoup
headers = {
'content-type': 'text/html;charset=UTF-8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
url = 'https://www.amazon.ca/INIU-High-Speed-Flashlight-Powerbank-Compatible/dp/B07CZDXDG8'
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.content, 'lxml')
# Extracting information from page
print('product_title -', soup.find('span', id='productTitle').text.strip())
print('product_price -', soup.find('span', id='priceblock_ourprice').text.strip())
The code yields:
product_title - INIU Power Bank, Ultra-Slim Dual 3A High-Speed Portable Charger, 10000mAh USB C Input & Flashlight External Phone Battery Pack for iPhone Xs X 8 Plus Samsung S10 Google LG iPad etc. [2020 Upgrade]
product_price - CDN$ 60.66

Related

Python scraping email protection address from href link

I want to get email adresses from :
[1]: https://thenationalweddingdirectory.com.au/suppliers/wedding-venues/queensland/the-dock-mooloolaba-events/
right now I have the code, but how can i scrap the email address from the clicked link?
from requests_html import HTMLSession
url = 'https://thenationalweddingdirectory.com.au/explore/?category=wedding-venues&region=melbourne&sort=top-rated'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=1)
products = r.html.xpath('//*[#id="finderListings"]/div[2]', first=True)
for item in products.absolute_links:
r = s.get(item)
print(r.html.find('li.lmb-calltoaction a', first=True))

Email, telephone is on the page, there are one json with all info you need.
Also you have some "ajax" request to get all URLs to visit.
import json
from bs4 import BeautifulSoup
import requests
import re
params = {
'mylisting-ajax': '1',
'action': 'get_listings',
'form_data[page]': '0',
'form_data[preserve_page]': 'false',
'form_data[category]': 'wedding-venues',
'form_data[region]': 'melbourne',
'form_data[sort]': 'top-rated',
'listing_type': 'place',
}
response = requests.get('https://thenationalweddingdirectory.com.au/', params=params)
# get all urls
results = re.findall("https://thenationalweddingdirectory.com.au/suppliers/wedding-venues/melbourne/[a-zA-Z-]*/",
response.text.replace("\\", ""))
headers = {
'accept': '*/*',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8,es;q=0.7,ru;q=0.6',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
for result in results:
print("Navigate: " + result)
response = requests.get(result, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
scripts = soup.find_all("script")
for script in scripts:
if "LocalBusiness" in script.text:
data = json.loads(script.text)
print("Name: " + data["name"])
print("Telephone: " + data["telephone"])
print("Email: " + data["email"])
break
OUTPUT:
Navigate: https://thenationalweddingdirectory.com.au/suppliers/wedding-venues/melbourne/metropolis-events/
Name: Metropolis Events
Telephone: 03 8537 7300
Email: info#metropolisevents.com.au
Navigate: https://thenationalweddingdirectory.com.au/suppliers/wedding-venues/melbourne/cotham-dining/
Name: Cotham Dining
Telephone: 0411 931 818
Email: hello#cothamdining.com.au

Scrape Verify link Href From Sites

I want to get the Verify href from GmailnatorInbox and this site contains the href discord verify which is the following Discord Verify HREF
I want to get this href using bs4 and pass it into a selenium driver link like driver.get(url) the url being the href ofc.
Can someone make some code to scrape the href from the gmailnator inbox please? I did try the page source however the page source does not contain the href.
This is the code I have written to get the href but the href that I require (discord one) is in a frame source so I think that's why it doesnt come up.
UPDATE! EVERYTHING IS DONE AND FIXED
driver.get('https://www.gmailnator.com/inbox/#for.ev.e.r.my.girlt.m.p#gmail.com')
time.sleep(6)
driver.find_element_by_xpath('//*[#id="mailList"]/tbody/tr[2]/td/a/table/tbody/tr/td[1]').click()
time.sleep(4)
url = driver.current_url
email_for_data = driver.current_url.split('/')[-3]
print(url)
time.sleep(2)
print('Getting Your Discord Verify link')
print('Time To Get Your Discord Link')
soup = BeautifulSoup(requests.get(url).text, "lxml")
data_email = soup.find("")
token = soup.find("meta", {"name": "csrf-token"})["content"]
cf_email = soup.find("a", class_="__cf_email__")["data-cfemail"]
endpoint = "https://www.gmailnator.com/mailbox/get_single_message/"
data = {
"csrf_gmailnator_token": token,
"action": "get_message",
"message_id": url.split("#")[-1],
"email": f"{email_for_data}",
}
headers = {
"referer": f"https://www.gmailnator.com/{email_for_data}/messageid/",
"cookie": f"csrf_gmailnator_cookie={token}; ci_session={cf_email}",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 "
"YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
r = requests.post(endpoint, data=data, headers=headers)
the_real_slim_shady = (
BeautifulSoup(r.json()["content"], "lxml")
.find_all("a", {"target": "_blank"})[1]["href"]
)
print(the_real_slim_shady)

You can fake it all with pure requests to get the Verify link. First, you need to get the token and the cf_email values. Then, things are pretty straightforward.
Here's how to get the link:
import requests
from bs4 import BeautifulSoup
url = "https://www.gmailnator.com/geralddoreyestmp/messageid/#179b454b4c482c4d"
soup = BeautifulSoup(requests.get(url).text, "lxml")
token = soup.find("meta", {"name": "csrf-token"})["content"]
cf_email = soup.find("a", class_="__cf_email__")["data-cfemail"]
endpoint = "https://www.gmailnator.com/mailbox/get_single_message/"
data = {
"csrf_gmailnator_token": token,
"action": "get_message",
"message_id": url.split("#")[-1],
"email": "geralddoreyestmp",
}
headers = {
"referer": "https://www.gmailnator.com/geralddoreyestmp/messageid/",
"cookie": f"csrf_gmailnator_cookie={token}; ci_session={cf_email}",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.86 "
"YaBrowser/21.3.0.740 Yowser/2.5 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
r = requests.post(endpoint, data=data, headers=headers)
the_real_slim_shady = (
BeautifulSoup(r.json()["content"], "lxml")
.find_all("a", {"target": "_blank"})[1]["href"]
)
print(the_real_slim_shady)
Output (your link will be different!):
https://click.discord.com/ls/click?upn=qDOo8cnwIoKzt0aLL1cBeARJoBrGSa2vu41A5vK-2B4us-3D77CR_3Tswyie9C2vHlXKXm6tJrQwhGg-2FvQ76GD2o0Zl2plCYHULNsKdCuB6s-2BHk1oNirSuR8goxCccVgwsQHdq1YYeGQki4wtPdDA3zi661IJL7H0cOYMH0IJ0t3sgrvr2oMX-2BJBA-2BWZzY42AwgjdQ-2BMAN9Y5ctocPNK-2FUQLxf6HQusMayIeATMiTO-2BlpDytu-2FnIW4axB32RYQpxPGO-2BeHtcSj7a7QeZmqK-2B-2FYkKA4dl5q8I-3D

trying to webscrape date of last document on a webpage in python

I am trying to get the date as below : 01/19/2021, I would like to get the "19" in a python variable
<span class="grayItalic">
Received: 01/19/2021
</span>
here is the piece of code unworking :
date = soup.find('span', {'class': 'grayItalic'}).get_text()
converted_date = int(date[13:14])
print(date)
I get this error : 'NoneType' object has no attribute 'get_text'
anyone could help ?

Try this with headers:
import requests
from bs4 import BeautifulSoup
url = "https://iapps.courts.state.ny.us/nyscef/DocumentList?docketId=npvulMdOYzFDYIAomW_PLUS_elw==&display=all"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content,'html.parser')
date = soup.find('span', {'class': 'grayItalic'}).get_text().strip()
converted_date = int(date.split("/")[-2])
print(converted_date)
print(date)

import dateutil.parser
from bs4 import BeautifulSoup
html_doc=""""<span class="grayItalic">
Received: 01/19/2021
</span>"""
soup=BeautifulSoup(html_doc,'html.parser')
date_ = soup.find('span', {'class': 'grayItalic'}).get_text()
dateutil.parser.parse(date_,fuzzy=True)
Output:
datetime.datetime(2021, 1, 19, 0, 0)
date_ outputs '\n Received: 01/19/2021\n' You've string slicing ,instead you can use \[dateutil.parser\]. Which will return a datetime.datetime object for you.
In this case i've assumed you just need the date.If at all you need the text too,you can use fuzzy_with_tokens=True.
if the fuzzy_with_tokens option is True, returns a tuple, the first element being a datetime.datetime object, the second a tuple containing the fuzzy tokens.
dateutil.parser.parse(date_,fuzzy_with_tokens=True)
(datetime.datetime(2021, 1, 19, 0, 0), (' Received: ', ' '))

I couldn't load the URL using the request or urllib module. I guess the website is blocking automatic connection requests. So I opened the webpage and save the source code in a file name page.html and ran the BeautifulSoup operation in it. And it seems that worked.
html = open("page.html")
soup = BeautifulSoup(html, 'html.parser')
date_span = soup.find('span', {'class': 'grayItalic'})
if date_span is not None:
print(str(date_span.text).strip().replace("Received: ", ""))
# output: 04/25/2019
I tried scraping the source code with the request library as follows but it didn't work (probably the webpage is blocking the request). See if it works on your machine.
url = "..."
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
response = requests.get(url, headers=headers)
html = response.content
print(html)

SERP Scraping with Beautiful Soup

I'm trying to build a simple script to scrape Google's first Search Results Page and export the results in .csv.
I managed to get URLs and Titles, but I cannot retrieve Descriptions.
I have been using the following code:
import urllib
import requests
from bs4 import BeautifulSoup
# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
query = "pizza recipe"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
desc = g.select('span')
description = g.find('span',{'class':'st'}).text
item = {
"title": title,
"link": link,
"description": description
}
results.append(item)
import pandas as pd
df = pd.DataFrame(results)
df.to_excel("Export.xlsx")
I get the following message when I run the code:
description = g.find('span',{'class':'st'}).text
AttributeError: 'NoneType' object has no attribute 'text'
Essentially, the field is empty.
Can somebody please help me this line so that I can get all the information from the snippet?

It's not within the div class="r". It's under div class="s"
So change to this for description:
description = g.find_next_sibling("div", class_='s').find('span',{'class':'st'}).text
From the current element, it'll find the next div, with class="s". Then you can pull out the <span> tag

Try to use select_one() or select() bs4 methods. They're more flexible and easy to read. CSS selectors reference.
Also, you can pass URL params since requests do everything for you like so:
# instead of this:
query = "pizza recipe"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
# try to use this:
params = {
'q': 'fus ro dah', # query
'hl': 'en'
}
requests.get('URL', params=params)
If you want to write to .csv then you need to use .to_csv() rather than .to_excel()
If you want to get rid of pandas index column, then you can pass index=False, e.g df.to_csv('FILE_NAME', index=False)
Code and example in the online IDE:
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
'q': 'fus ro dah', # query
'hl': 'en'
}
resp = requests.get("https://google.com/search", headers=headers, params=params)
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, "html.parser")
results = []
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
snippet = result.select_one('#rso .lyLwlc').text
item = {
"title": title,
"link": link,
"description": snippet
}
results.append(item)
df = pd.DataFrame(results)
df.to_csv("BS4_Export.csv", index=False)
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out what selectors to use and why they don't work although they should since it's already done for the end-user.
Code to integrate:
from serpapi import GoogleSearch
import os
import pandas as pd
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "fus ro dah",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
data = []
for result in results['organic_results']:
title = result['title']
link = result['link']
snippet = result['snippet']
data.append({
"title": title,
"link": link,
"snippet": snippet
})
df = pd.DataFrame(results)
df.to_csv("SerpApi_Export.csv", index=False)
P.S - I wrote a bit more detailed blog post about how to scrape Google Organic Results.
Disclaimer, I work for SerpApi.

Scraping with BS4 but HTML gets messed up when parsed

I'm having trouble scraping a website using BeautifulSoup4 and Python3. I'm using dryscrape to get the HTML since it requires JavaScript to be enabled in order to be shown (but as far as I know it's never used in the page itself).
This is my code:
from bs4 import BeautifulSoup
import dryscrape
productUrl = "https://www.mercadona.es/detall_producte.php?id=32009"
session = dryscrape.Session()
session.visit(productUrl)
response = session.body()
soup = BeautifulSoup(response, "lxml")
container1 = soup.find("div","contenido").find("dl").find_all("dt")
container3 = soup.find("div","contenido").find_all("td")
Now I want to read container3 content, but:
type(container3)
Returns:
bs4.element.ResultSet
which is the same as type(container1), but it's length it's 0!
So I wanted to know what was I getting to container3 before looking for my <td> tag, so I wrote it to a file.
container3 = soup.find("div","contenido")
soup_file.write(container3.prettify())
And, here is the link to that file: https://pastebin.com/xc22fefJ
It gets all messed up just before the table I want to scrape. I can't understand why, looking at the URL source code from Firefox everything looks fine.

Here's the updated solution:
url = 'https://www.mercadona.es/detall_producte.php?id=32009'
rh = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.mercadona.es',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
s = requests.session()
r = s.get(url, headers = rh)
The response to this gives you the Please enable JavaScript to view the page content. message. However, it also contains the necessary hidden data sent by browser using javascript, which can be seen from the network tab of the developer's tools.
TS015fc057_id: 3
TS015fc057_cr: a57705c08e49ba7d51954bea1cc9bfce:jlnk:l8MH0eul:1700810263
TS015fc057_76: 0
TS015fc057_86: 0
TS015fc057_md: 1
TS015fc057_rf: 0
TS015fc057_ct: 0
TS015fc057_pd: 0
Of these, the second one (the long string) is generated by javascript. We can use a library like js2py to run the code, which will return the required string to be passed in the request.
soup = BeautifulSoup(r.content, 'lxml')
script = soup.find_all('script')[1].text
js_code = re.search(r'.*(function challenge.*crc;).*', script, re.DOTALL).groups()[0] + '} challenge();'
js_code = js_code.replace('document.forms[0].elements[1].value=', 'return ')
hidden_inputs = soup.find_all('input')
hidden_inputs[1]['value'] = js2py.eval_js(js_code)
fd = {i['name']: i['value'] for i in hidden_inputs}
rh = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Referer': 'https://www.mercadona.es/detall_producte.php?id=32009',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Content-Length': '188',
'Content-Type': 'application/x-www-form-urlencoded',
'Cache-Control': 'max-age=0',
'Host': 'www.mercadona.es',
'Origin': 'https://www.mercadona.es',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
# NOTE: the next one is a POST request, as opposed to the GET request sent before
r = s.post(url, headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
And here's the result:
>>> len(soup.find('div', 'contenido').find_all('td'))
70
>>> len(soup.find('div', 'contenido').find('dl').find_all('dt'))
8
EDIT
Apparently, the javascript code needs to be run only once. The resulting data can be used for more than one request, like this:
for i in range(32007, 32011):
r = s.post(url[:-5] + str(i), headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
print(soup.find_all('dd')[1].text)
Result:
Manzana y plátano 120 g
Manzana y plátano 720g (6x120) g
Fresa Plátano 120 g
Fresa Plátano 720g (6x120g)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape data from Amazon Canada? - python

Related

Python scraping email protection address from href link

Scrape Verify link Href From Sites

trying to webscrape date of last document on a webpage in python

SERP Scraping with Beautiful Soup

Scraping with BS4 but HTML gets messed up when parsed

Categories

Resources