Extracting string data from a html source

Extracting string data from a html source - python

from bs4 import BeautifulSoup
import urllib.request
page = urllib.request.urlopen('https://www.applied.com/categories/bearings/accessories/adapter-sleeves/c/1580?q=%3Arelevance&page=1')
html = page.read()
soup = BeautifulSoup(html)
items = soup.find_all(class_= 'product product--list ')
for i in items[0:1]:
product_name = i.find(class_="product__name").a.string.strip()
print(product_name)
product_url = i.find(class_="product__name").a['href']
print(product_url)
price = i.find(itemprop="price").string
print(price)
Using the above code I tried to get the price for each product in that page.
But when i tried, the output for price variable is showing as none.
When I inspect the html source for the price in a browser it is showing the price as a normal text as how I got for product_name variable.
Can someone guide me on how to get the price for the products in that page.

Price is loaded by Ajax(https://www.applied.com/getprices) after page is loaded that's why it is not in HTML.
Use https://www.applied.com/getprices to get the price of an item
You have to send post request with following params for getting the price of the product.
{
"productCodes": "100731658",
"page": "PLP",
"productCode": "100731658",
"CSRFToken": "172c7073-742f-4d7d-9c97-358e0d9e631e"
}

Related

Is there someone have success in scraping from Amazon using Beautifulsoup?

I want to make a web scraper of Amazon.
But, It looks like that everydata is None type.
I found in google and there are many peoples who make a web scraper of Amazon.
Please, give me some advice to solve this Nonetype issue.
Here is my code:
import requests
from bs4 import BeautifulSoup
amazon_dir = requests.get("https://www.amazon.es/s?k=docking+station&__mk_es_ES=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=34FO3BVVCJS4V&sprefix=docking%2Caps%2C302&ref=nb_sb_ss_ts-doa-p_1_7")
amazon_soup = BeautifulSoup(amazon_dir.text, "html.parser")
product_table = amazon_soup.find("div", {"class": "sg-col-inner"})
print(product_table)
products = product_table.find("div", {"class": "a-section"})
name = products.find("span", {"class": "a-size-base-plus"})
rating = products.find("span", {"class": "a-icon-alt"})
price = products.find("span", {"class": "a-price-whole"})
print(name, rating, price)
Thank you

Portals may check header User-Agent to send different HTML for different browsers or devices and sometimes this can make problem to find elements on page.
But usually portals check this header to block scripts/bots.
For example requests sends User-Agent: python-requests/2.26.0.
If I use header User-Agent from real browser or at least shorter version Mozilla/5.0 then code works.
There is other problem.
There is almost 70 elements <div class="sg-col-inner" ...> and table is as 3th element but find() gives only first element. You have to use find_all() and later use [2] to get 3th element.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0',
}
url = "https://www.amazon.es/s?k=docking+station&__mk_es_ES=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=34FO3BVVCJS4V&sprefix=docking%2Caps%2C302&ref=nb_sb_ss_ts-doa-p_1_7"
response = requests.get(url, headers=headers)
print(response.text[:1000])
print('---')
amazon_soup = BeautifulSoup(response.text, "html.parser")
all_divs = amazon_soup.find_all("div", {"class": "sg-col-inner"})
print('len(all_divs):', len(all_divs))
print('---')
products = all_divs[3].find("div", {"class": "a-section"})
name = products.find("span", {"class": "a-size-base-plus"})
rating = products.find("span", {"class": "a-icon-alt"})
price = products.find("span", {"class": "a-price-whole"})
print('name:', name.text)
print('rating:', rating.text)
print('price:', price.text)
EDIT:
Version which display all products:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0',
}
url = "https://www.amazon.es/s?k=docking+station&__mk_es_ES=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=34FO3BVVCJS4V&sprefix=docking%2Caps%2C302&ref=nb_sb_ss_ts-doa-p_1_7"
response = requests.get(url, headers=headers)
#print(response.text[:1000])
#print('---')
soup = BeautifulSoup(response.text, "html.parser")
results = soup.find("div", {"class": "s-main-slot s-result-list s-search-results sg-row"})
all_products = results.find_all("div", {"class": "sg-col-inner"})
print('len(all_products):', len(all_products))
print('---')
for item in all_products:
name = item.find("span", {"class": "a-size-base-plus"})
rating = item.find("span", {"class": "a-icon-alt"})
price = item.find("span", {"class": "a-price-whole"})
if name:
print('name:', name.text)
if rating:
print('rating:', rating.text)
if price:
print('price:', price.text)
if name or rating or price:
print('---')
BTW:
From time to time portals refresh code and HTML on servers - so if you find tutorial then check how old it is. Older tutorials may not work because portals could changed something in code.
Many modern pages start using JavaScript to add elements but requests and BeautifulSoup can't run JavaScript. And this may need to use Selenium to control real web browser which can run JavaScript.

bs4 findAll not collecting all of the data from the other pages on the website

I'm trying to scrape a real estate website using BeautifulSoup.
I'm trying to get a list of rental prices for London. This works but only for the first page on the website. There are over 150 of them so I'm missing out on a lot of data. I would like to be able to collect all the prices from all the pages. Here is the code I'm using:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home'
response = requests.get(url)
response.status_code
data = soup(response.content, 'lxml')
prices = []
for line in data.findAll('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}):
price = str(line).split('>')[2].split(' ')[0].replace('£', '').replace(',','')
price = int(price)
prices.append(price)
Any idea as to why I can't collect the prices from all the pages using this script?
Extra question : is there a way to access the price using soup, IE with doing any list/string manipulation? When I call data.find('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}) I get a string of the following form <div class="css-1e28vvi-PriceContainer e2uk8e7" data-testid="listing-price"><p class="css-1o565rw-Text eczcs4p0" size="6">£3,012 pcm</p></div>
Any help would be much appreciated!

You can append &pn=<page number> parameter to the URL to get next pages:
import re
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home&pn="
prices = []
for page in range(1, 3): # <-- increase number of pages here
data = soup(requests.get(url + str(page)).content, "lxml")
for line in data.findAll(
"div", {"class": "css-1e28vvi-PriceContainer e2uk8e7"}
):
price = line.get_text(strip=True)
price = int(re.sub(r"[^\d]", "", price))
prices.append(price)
print(price)
print("-" * 80)
print(len(prices))
Prints:
...
1993
1993
--------------------------------------------------------------------------------
50

Scraping website with BS4 // accessing class

I am tring to extract different information from websites with BeautifulSoup, such as title of the product and the price.
I do that with different urls, looping through the urls with for...in.... Here, I'll just provide a snippet without the loop.
from bs4 import BeautifulSoup
import requests
import csv
url= 'https://www.mediamarkt.ch/fr/product/_lg-oled65gx6la-1991479.html'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
price = soup.find('meta', property="product:price:amount")
title = soup.find("div", {"class": "flix-model-name"})
title2 = soup.find('div', class_="flix-model-name")
title3 = soup.find("div", attrs={"class": "flix-model-name"})
print(price['content'])
print(title)
print(title2)
print(title3)
So from this URL https://www.mediamarkt.ch/fr/product/_lg-oled65gx6la-1991479.html I wasnt to extract the product number. the only place I find it is in the div class="flix-model-name". However, I am totally unable to reach it. I tried different ways to access it in the title, title2, title3, but I always have the output none.
I am a bit of a beginner, so I guess I am probably missing something basic... If so, please pardon me for that.
Any help is welcome! Many thanks in advance!
just for info, with each url I thought of appending the data and write them on a CSV file like that:
for url in urls:
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
row=[]
try:
# title = YOUR VERY WELCOMED ANSWER
prices = soup.find('meta', property="product:price:amount")
row = (title.text+','+prices['content']+'\n')
data.append(row)
except:
pass
file = open('database.csv','w')
i = 0
while i < (len(data)):
file.write(data[i])
i +=1
file.close()
Many thanks in advance for your help!
David

Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.
What exactly below script is doing:
First it will take the API URL, create the URL based on 2 dynamic parameters(product and category) and then do GET request to get the data.
After getting the data script will parse the JSON data using json.loads library.
Finally, it will iterate all over the list of products one by one and print the details which are divided in 2 categotries 'box1_ProductToProduct' and 'box2_KategorieTopseller' like Brand, Name, Product number and Unit price. Same way you can add more details by looking in to the API call.
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def scrap_product_details():
PRODUCT = 'MMCH1991479' #Product number
CATEGORY = '680942' #Category number
URL = 'https://www.mediamarkt.ch/rde_server/res/MMCH/recomm/product_detail/sid/WACXyEbIf3khlu6FcHlh1B1?product=' + PRODUCT + '&category=' + CATEGORY # dynamic URL
response = requests.get(URL,verify = False) #GET request to fetch the data
result = json.loads(response.text) # Parse JSON data using json.loads
box1_ProductToProduct = result[0]['box1_ProductToProduct'] # Extracted data from API
box2_KategorieTopseller = result[1]['box2_KategorieTopseller']
for item in box1_ProductToProduct: # loop over extracted data
print('-' * 100)
print('Brand : ',item['brand'])
print('Name : ',item['name'])
print('Net Unit Price : ',item['netUnitPrice'])
print('Product Number : ',item['product_nr'])
print('-' * 100)
for item in box2_KategorieTopseller: # loop over extracted data
print('-' * 100)
print('Brand : ',item['brand'])
print('Name : ',item['name'])
print('Net Unit Price : ',item['netUnitPrice'])
print('Product Number : ',item['product_nr'])
print('-' * 100)
scrap_product_details()

Problem with For Loop in Python BeautifulSoup web scraping

I'm a beginner with Python & trying to learn with a BeautifulSoup webscraping project.
I'm looking to scrape the record item title, URL of item & purchase date from this URL & export to a CSV.
I made great progress with scraping title & URL but just cannot figure out how to properly code the purchase date info correctly in my for loop (purchase_date variable below).
What's currently happening is the data in the csv file for the purchase date (e.g. p_date title) just displays blank cells with no text.. no error message just no data getting put into csv. Any guidance is much appreciated.
Thank you!!
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
headers = {"Accept-Language": "en-US, en;q=0.5"}
url = "https://www.popsike.com/php/quicksearch.php?searchtext=metal+-signed+-promo+-beatles+-zeppelin+-acetate+-test+-sinatra&sortord=aprice&pagenum=1&incldescr=1&sprice=100&eprice=&endfrom=2020&endthru=2020&bidsfrom=&bidsthru=&layout=&flabel=&fcatno="
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
title = []
date = []
URL = []
record_div = soup.find_all('div', class_='col-md-7 add-desc-box')
for container in record_div:
description = container.a.text
title.append(description)
link = container.find('a')
URL.append(link.get('href'))
purchase_date = container.find('span',class_= 'info-row').text
date.append(purchase_date)
test_data = pd.DataFrame({
'record_description': title,
'link': URL,
'p_date': date
})
test_data['link'] = test_data['link'].str.replace('../','https://www.popsike.com/',1)
print(test_data)
test_data.to_csv('popaaron.csv')

I suggest to change parser type:
soup = BeautifulSoup(results.text, "html5")
And fix search expression for purchase date:
purchase_date = container.select('span.date > b')[0].text.strip(' \t\n\r')

Get a <span> value using python web scrape

I am trying to get a product price using BeautifulSoup in python.
But i keep getting erroes, no matter what I try.
The picture of the site i am trying to web scrape
I want to get the 19,90 value.
I have already done a code to get all the product names, and now need their prices.
import requests
from bs4 import BeautifulSoup
url = 'https://www.zattini.com.br/busca?nsCat=Natural&q=amaro&searchTermCapitalized=Amaro&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
price = soup.find('span', itemprop_='price')
print(price)

Less ideal is parsing out the JSON containing the prices
import requests
import json
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.zattini.com.br/busca?nsCat=Natural&q=amaro&searchTermCapitalized=Amaro&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
scripts = [script.text for script in soup.select('script') if 'var freedom = freedom ||' in script.text]
pricesJson = scripts[0].split('"items":')[1].split(']')[0] + ']'
prices = [item['price'] for item in json.loads(pricesJson)]
names = [name.text for name in soup.select('#item-list [itemprop=name]')]
results = list(zip(names,prices))
df = pd.DataFrame(results)
print(df)
Sample output:

span[itemprop='price'] is generated by javascript. Original value stored in div[data-final-price] with value like 1990 and you can format it to 19,90 with Regex.
import re
...
soup = BeautifulSoup(page.text, 'html.parser')
prices = soup.select('div[data-final-price]')
for price in prices:
price = re.sub(r'(\d\d$)', r',\1', price['data-final-price'])
print(price)
Results:
19,90
134,89
29,90
119,90
104,90
59,90
....

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting string data from a html source - python

Related

Is there someone have success in scraping from Amazon using Beautifulsoup?

bs4 findAll not collecting all of the data from the other pages on the website

Scraping website with BS4 // accessing class

Problem with For Loop in Python BeautifulSoup web scraping

Get a <span> value using python web scrape

Categories

Resources