Why my web scraping code is not extracting data like it should?

Why my web scraping code is not extracting data like it should? - python

I am trying to get data from a online shopping website. My code runs without any error but the data is not getting extracted to the csv file like it should. Where am I going wrong with the code?
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome("/usr/bin/chromedriver")
products=[] #List to store name of the product
prices=[] #List to store price of the product
ratings=[] #List to store rating of the product
driver.get("https://www.flipkart.com/lenovo-core-i3-6th-gen-4-gb-1-tb-hdd-windows-10-home-ip-320e-laptop/p/itmf3s32ghxrkrhf?pid=COMEWM7FTAQ9EHRF&srno=b_1_2&otracker=browse&lid=LSTCOMEWM7FTAQ9EHRFBL70ZV&fm=organic&iid=90098c10-e53b-49dc-9359-ff04338c0c4e.COMEWM7FTAQ9EHRF.SEARCH&ssid=2d6xzladk00000001572540087124")
content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True, attrs={'class':'_29OxBi'}):
name = a.find('div', attrs={'class':'_35KyD6'})
price = a.find('div', attrs={'class':'_1vC4OE _3qQ9m1'})
rating= a.find('div', attrs={'class':'hGSR34'})
products.append(name.text)
prices.append(price.text)
ratings.append(rating.text)
df = pd.DataFrame({'Product Name':products,'Price':prices,'Rating':ratings})
df.to_csv('products.csv', index=False, encoding='utf-8')
I expect the code to return data such as name, price and rating of the products available on the website.

flipkart : It is loaded dynamically from a script tag when the browser executes javascript in the webpage. You can regex out this info and parse with json parser to retrieve required info just using requests; without the overhead of selenium.
import requests, re, json
p = re.compile(r'window\.__INITIAL_STATE__ = (.*);')
r = requests.get('https://www.flipkart.com/lenovo-core-i3-6th-gen-4-gb-1-tb-hdd-windows-10-home-ip-320e-laptop/p/itmf3s32ghxrkrhf?pid=COMEWM7FTAQ9EHRF&srno=b_1_2&otracker=browse&lid=LSTCOMEWM7FTAQ9EHRFBL70ZV&fm=organic&iid=90098c10-e53b-49dc-9359-ff04338c0c4e.COMEWM7FTAQ9EHRF.SEARCH&ssid=2d6xzladk00000001572540087124')
data = json.loads(p.findall(r.text)[0])['pageDataV4']['page']['data']['10002'][1]['widget']['data']
##data sections:
# data.keys()
##pricing info:
# data['pricing']['value'].keys()
# data['pricing']['value']['mrp'].keys()
##rating info:
# data['ratingsAndReviews']['value']['rating']
price = data['pricing']['value']['mrp']['currency'] + str(data['pricing']['value']['mrp']['value'])
title = ' '.join(reversed([v for k,v in data['titleComponent']['value'].items() if k in ['title', 'subtitle']]))
average_rating = data['ratingsAndReviews']['value']['rating']['average']

Related

Save web scraped result to txt files by names

I scraped a list of professor contact information from a school website, and now I want to save them individually by name, each name txt file contains their email, tel and office.
ideal outcome
Currently my code is
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.cb.cityu.edu.hk/is/people/academic.html'
webpage = requests.get(url)
page = bs(webpage.content, 'html.parser')
#define list
name_list = []
phone_list = []
email_list = []
result = page.find_all('div', attrs = {'class': 'staff-details'})
for person in result:
print(person.text)

You can use a loop to fetch the data and simultaneously save the data in a text file.
from bs4 import BeautifulSoup as bs
import requests
url = 'https://www.cb.cityu.edu.hk/is/people/academic.html'
webpage = requests.get(url)
page = bs(webpage.content, 'html.parser')
prof_list = page.select(".staff-details")
for i in prof_list:
name = i.select_one('.name >a').text
email = i.select_one('.list-info div.value:nth-child(2) > a').text
phone = i.select_one('.list-info div.value:nth-child(4)').text
office = i.select_one('.list-info div.value:nth-child(6)').text
with open(name+'.txt', 'w+') as file:
file.write("Email:\n")
file.write(email)
file.write('\nPhone:\n')
file.write(phone)
file.write("\nOffice\n")
file.write(office)

Open file with name you want using context manager in w+ mode
here is a sample code for you
---> Inside Your For Loop
with open(file_name_come_here,"w+") as f :
f.write(content_come_here_as_string)

Scraping website with BS4 // accessing class

I am tring to extract different information from websites with BeautifulSoup, such as title of the product and the price.
I do that with different urls, looping through the urls with for...in.... Here, I'll just provide a snippet without the loop.
from bs4 import BeautifulSoup
import requests
import csv
url= 'https://www.mediamarkt.ch/fr/product/_lg-oled65gx6la-1991479.html'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
price = soup.find('meta', property="product:price:amount")
title = soup.find("div", {"class": "flix-model-name"})
title2 = soup.find('div', class_="flix-model-name")
title3 = soup.find("div", attrs={"class": "flix-model-name"})
print(price['content'])
print(title)
print(title2)
print(title3)
So from this URL https://www.mediamarkt.ch/fr/product/_lg-oled65gx6la-1991479.html I wasnt to extract the product number. the only place I find it is in the div class="flix-model-name". However, I am totally unable to reach it. I tried different ways to access it in the title, title2, title3, but I always have the output none.
I am a bit of a beginner, so I guess I am probably missing something basic... If so, please pardon me for that.
Any help is welcome! Many thanks in advance!
just for info, with each url I thought of appending the data and write them on a CSV file like that:
for url in urls:
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
row=[]
try:
# title = YOUR VERY WELCOMED ANSWER
prices = soup.find('meta', property="product:price:amount")
row = (title.text+','+prices['content']+'\n')
data.append(row)
except:
pass
file = open('database.csv','w')
i = 0
while i < (len(data)):
file.write(data[i])
i +=1
file.close()
Many thanks in advance for your help!
David

Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.
What exactly below script is doing:
First it will take the API URL, create the URL based on 2 dynamic parameters(product and category) and then do GET request to get the data.
After getting the data script will parse the JSON data using json.loads library.
Finally, it will iterate all over the list of products one by one and print the details which are divided in 2 categotries 'box1_ProductToProduct' and 'box2_KategorieTopseller' like Brand, Name, Product number and Unit price. Same way you can add more details by looking in to the API call.
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def scrap_product_details():
PRODUCT = 'MMCH1991479' #Product number
CATEGORY = '680942' #Category number
URL = 'https://www.mediamarkt.ch/rde_server/res/MMCH/recomm/product_detail/sid/WACXyEbIf3khlu6FcHlh1B1?product=' + PRODUCT + '&category=' + CATEGORY # dynamic URL
response = requests.get(URL,verify = False) #GET request to fetch the data
result = json.loads(response.text) # Parse JSON data using json.loads
box1_ProductToProduct = result[0]['box1_ProductToProduct'] # Extracted data from API
box2_KategorieTopseller = result[1]['box2_KategorieTopseller']
for item in box1_ProductToProduct: # loop over extracted data
print('-' * 100)
print('Brand : ',item['brand'])
print('Name : ',item['name'])
print('Net Unit Price : ',item['netUnitPrice'])
print('Product Number : ',item['product_nr'])
print('-' * 100)
for item in box2_KategorieTopseller: # loop over extracted data
print('-' * 100)
print('Brand : ',item['brand'])
print('Name : ',item['name'])
print('Net Unit Price : ',item['netUnitPrice'])
print('Product Number : ',item['product_nr'])
print('-' * 100)
scrap_product_details()

Problem with For Loop in Python BeautifulSoup web scraping

I'm a beginner with Python & trying to learn with a BeautifulSoup webscraping project.
I'm looking to scrape the record item title, URL of item & purchase date from this URL & export to a CSV.
I made great progress with scraping title & URL but just cannot figure out how to properly code the purchase date info correctly in my for loop (purchase_date variable below).
What's currently happening is the data in the csv file for the purchase date (e.g. p_date title) just displays blank cells with no text.. no error message just no data getting put into csv. Any guidance is much appreciated.
Thank you!!
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
headers = {"Accept-Language": "en-US, en;q=0.5"}
url = "https://www.popsike.com/php/quicksearch.php?searchtext=metal+-signed+-promo+-beatles+-zeppelin+-acetate+-test+-sinatra&sortord=aprice&pagenum=1&incldescr=1&sprice=100&eprice=&endfrom=2020&endthru=2020&bidsfrom=&bidsthru=&layout=&flabel=&fcatno="
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
title = []
date = []
URL = []
record_div = soup.find_all('div', class_='col-md-7 add-desc-box')
for container in record_div:
description = container.a.text
title.append(description)
link = container.find('a')
URL.append(link.get('href'))
purchase_date = container.find('span',class_= 'info-row').text
date.append(purchase_date)
test_data = pd.DataFrame({
'record_description': title,
'link': URL,
'p_date': date
})
test_data['link'] = test_data['link'].str.replace('../','https://www.popsike.com/',1)
print(test_data)
test_data.to_csv('popaaron.csv')

I suggest to change parser type:
soup = BeautifulSoup(results.text, "html5")
And fix search expression for purchase date:
purchase_date = container.select('span.date > b')[0].text.strip(' \t\n\r')

Unable to scrape the right wikitable with BeautifulSoup4 (beginner)

A complete beginner here...I am trying to scrape the constituents table from this Wikipedia page, however the table scraped was the annual returns (1st table) instead of the constituents table (2nd table) that I need. Could someone help to see if there is any way that i can target the specific table that i want using BeautifulSoup4?
import bs4 as bs
import pickle
import requests
def save_klci_tickers():
resp = requests.get ('https://en.wikipedia.org/wiki/FTSE_Bursa_Malaysia_KLCI')
soup = bs.BeautifulSoup(resp.text)
table = soup.find ('table', {'class': 'wikitable sortable'})
tickers = []
for row in table.findAll ('tr') [1:]:
ticker = row.findAll ('td') [0].text
tickers.append(ticker)
with open ("klcitickers.pickle", "wb") as f:
pickle.dump (tickers, f)
print (tickers)
return tickers
save_klci_tickers()

Try pandas library to get the tabular data from that page in a csv file with the blink of an eye:
import pandas as pd
url = 'https://en.wikipedia.org/wiki/FTSE_Bursa_Malaysia_KLCI'
df = pd.read_html(url, attrs={"class": "wikitable"})[1] #change the index to get the table you need from that page
new = pd.DataFrame(df, columns=["Constituent Name", "Stock Code", "Sector"])
new.to_csv("wiki_data.csv", index=False)
print(df)
If it is still BeautifulSoup you wanna stick with, the following should serve the purpose:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://en.wikipedia.org/wiki/FTSE_Bursa_Malaysia_KLCI")
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select("table.wikitable")[1].select("tr"):
data = [item.get_text(strip=True) for item in items.select("th,td")]
print(data)
If you wanna use .find_all() instead of .select(), try the following:
for items in soup.find_all("table",class_="wikitable")[1].find_all("tr"):
data = [item.get_text(strip=True) for item in items.find_all(["th","td"])]
print(data)

WebCrawler, only few items have discounted prices - index error

I am new to programming and am trying to build my first little web crawler in python.
Goal: Crawling a product list page - scraping brand name, article name, original price and new price - saving in CSV file
Status: I've managed to get the brand name, article name as well as original price and put them into correct order into a list (e.g. 10 products). As there is a brand name, description and price for all items, my code get them in correct order into the csv.
Code:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myUrl = 'https://www.zalando.de/rucksaecke-herren/'
#open connection, grabbing page, saving in page_html and closing connection
uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()
#Datatype, html paser
page_soup = soup(page_html, "html.parser")
#grabbing information
brand_Names = page_soup.findAll("div",{"class": "z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn"})
articale_Names = page_soup.findAll ("div",{"class": "z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn"})
original_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_originalPrice-2Oy4G"})
new_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_promotionalPrice-3GRE7"})
#opening a csv file and printing its header
filename = "XXX.csv"
file = open(filename, "w")
headers = "BRAND, ARTICALE NAME, OLD PRICE, NEW PRICE\n"
file.write(headers)
#How many brands on page?
products_on_page = len(brand_Names)
#Looping through all brands, atricles, prices and writing the text into the CSV
for i in range(products_on_page):
brand = brand_Names[i].text
articale_Name = articale_Names[i].text
price = original_Prices[i].text
new_Price = new_Prices[i].text
file.write(brand + "," + articale_Name + "," + price.replace(",",".") + new_Price.replace(",",".") +"\n")
#closing CSV
file.close()
Problem: I am struggling with getting the discounted prices into my csv at the right place. Not every item has a discount and I currently see two issues with my code:
I use .findAll to look for the information on the website - as there are less discounted products then total products, my new_Prices contains fewer prices (e.g. 3 prices for 10 products). If i would be able to add them to the list, I assume they would show up in the first 3 rows. How can i make sure to add the new_Prices to the right prodcuts?
I am getting "Index Error: list index out of range" Error, which i assume is caused by the fact that i am looping through 10 products, however for new_Prices i am reaching the end quicker then for my other lists? Does that make sense and is that my assumption correct?
I am very much appreciating any help.
Thank,
Thorsten

Since some items don't have a 'div.z-nvg-cognac_promotionalPrice-3GRE7' tag you can't use the list index reliably.
However you can select all the container tags ('div.z-nvg-cognac_infoContainer-MvytX') and use find to select tags on each item.
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import csv
my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/'
client = urlopen(my_url)
page_html = client.read().decode(errors='ignore')
page_soup = soup(page_html, "html.parser")
headers = ["BRAND", "ARTICALE NAME", "OLD PRICE", "NEW PRICE"]
filename = "test.csv"
with open(filename, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(headers)
items = page_soup.find_all(class_='z-nvg-cognac_infoContainer-MvytX')
for item in items:
brand_names = item.find(class_="z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn").text
articale_names = item.find(class_="z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn").text
original_prices = item.find(class_="z-nvg-cognac_originalPrice-2Oy4G").text
new_prices = item.find(class_="z-nvg-cognac_promotionalPrice-3GRE7")
if new_prices is not None:
new_prices = new_prices.text
writer.writerow([brand_names, articale_names, original_prices, new_prices])
If you want to get more than 24 items per page you have to use a client that runs js, like selenium.
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import csv
my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/'
driver = webdriver.Firefox()
driver.get(my_url)
page_html = driver.page_source
driver.quit()
page_soup = soup(page_html, "html.parser")
...
Footnotes:
The naming conventions for functions and variables is lowercase with underscores.
When reading or writting csv files it's best to use the csv lib.
When handling files you can use the with statement.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why my web scraping code is not extracting data like it should? - python

Related

Save web scraped result to txt files by names

Scraping website with BS4 // accessing class

Problem with For Loop in Python BeautifulSoup web scraping

Unable to scrape the right wikitable with BeautifulSoup4 (beginner)

WebCrawler, only few items have discounted prices - index error

Categories

Resources