Python Web Scraper issue

Python Web Scraper issue - python

I'm new to to programming and trying to learn by building some small side projects. I have this code and it is working but I am having an issue with it formatting correctly in csv when it pulls all the information. It started adding weird spaces after I added price to be pulled as well. if I comment out price and remove it from write it works fine but I can't figure out why I am getting weird spaces when I add it back.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=graphics%20card&bop=And&PageSize=12&order=BESTMATCH"
# Opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
# grabs each products
containers = page_soup.findAll("div",{"class":"item-container"})
filename = "products.csv"
f = open(filename, "w")
headers = "brand, product_name, shipping\n"
f.write(headers)
for container in containers:
brand = container.div.div.a.img["title"]
title_container = container.findAll("a", {"class":"item-title"})
product_name = title_container[0].text
shipping_container = container.findAll("li", {"class":"price-ship"})
shipping = shipping_container[0].text.strip()
price_container = container.findAll("li", {"class":"price-current"})
price = price_container[0].text.strip()
print("brand: " + brand)
print("product_name: " + product_name)
print("Price: " + price)
print("shipping: " + shipping)
f.write(brand + "," + product_name.replace(",", "|") + "," + shipping + "," + price + "\n")
f.close()

You can write to a csv file like the way I've showed below. The output it produces should serve the purpose. Check out this documentation to get the clarity.
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
my_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=graphics%20card&bop=And&PageSize=12&order=BESTMATCH"
page_html = urlopen(my_url).read()
page_soup = BeautifulSoup(page_html, "lxml")
with open("outputfile.csv","w",newline="") as infile:
writer = csv.writer(infile)
writer.writerow(["brand", "product_name", "shipping", "price"])
for container in page_soup.findAll("div",{"class":"item-container"}):
brand = container.find(class_="item-brand").img.get("title")
product_name = container.find("a", {"class":"item-title"}).get_text(strip=True).replace(",", "|")
shipping = container.find("li", {"class":"price-ship"}).get_text(strip=True)
price = container.find("li", {"class":"price-current"}).get_text(strip=True).replace("|", "")
writer.writerow([brand,product_name,shipping,price])

You're getting the new lines and spam characters because that is the data you're getting back from BS4: it isn't a product of the writing process. This is because you're trying to get all the text in the list item, whilst there's a lot going on in there. Having a look at the page, if you'd rather just get the price, you can concatenate the text of the strong tag within the list with the text of the sup tag, e.g. price = price_container[0].find("strong").text + price_container[0].find("sup").text. That will ensure you're only picking out the data that you need.

Related

How to stop trailing zeros splitting across cell with Beautiful Soup?

I'm building a web scraper. The top line on this data scrape splits the title because there the number "1,000" at the end. How do I stop this from happening?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.topcashback.co.uk/easyjet-holidays/'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr")[1:]
filename = "topcashbackEasyJetholidays.csv"
f = open(filename,"w")
headers = "title, rate \n"
f.write(headers)
for container in containers:
title = container.td.div.span.text
rate = container.find("span",{"class":"cashback-desc"}).text
print("title: " + title)
print("rate: " + rate)
f.write(title + "," + rate + "," "\n")
f.close()

The easy and ugly way - cover title with quotes so the comma in 1,000 won't be treat as separator in csv.
f.write('"' + title + '",' + rate + "," "\n") # btw. why the last comma?
# or with f-string
f.write(f'"{title}",{rate}\n")
The more fancy way - use csvwriter

I would check out this before trying to reinvent the wheel:
import pandas as pd
my_url = 'https://www.topcashback.co.uk/easyjet-holidays/'
tables = pd.read_html(my_url, encoding='utf-8')
df = tables[0]
df.columns = ['title', 'n/a', 'rate']
df = df[['title', 'rate']]
df.to_csv("topcashbackEasyJetholidays.csv", index=False)
print(df)
Output:
title rate
0 London Gatwick Departures over £1,000 £50.00
1 Holiday Bookings £1000 and Over £40.00
2 Holiday Bookings £999 and Under £25.00
CSV:
title,rate
"London Gatwick Departures over £1,000",£50.00
Holiday Bookings £1000 and Over,£40.00
Holiday Bookings £999 and Under,£25.00
You'll also need to have lxml installed, aka pip install lxml

Here's the "fancy way", which I think is clearly the better way to go. I find it to actually be an easier and simpler way to code up the problem:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
my_url = 'https://www.topcashback.co.uk/easyjet-holidays/'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr")[1:]
filename = "topcashbackEasyJetholidays.csv"
with open(filename,"w") as f:
writer = csv.writer(f)
writer.writerow(["title", "rate"])
for container in containers:
title = container.td.div.span.text
rate = container.find("span",{"class":"cashback-desc"}).text
print("title: " + title)
print("rate: " + rate)
writer.writerow([title, rate])
There are other advantages to using a CSV writer. The code is more readable and the details of the CSV file format are hidden. There are other characters that could cause you problems and the CSV writer will transparently deal with all of them. The CSV writer will only use quotes when it has to, making your CSV file smaller. If you support multiple output formats, the same code can be used to write all of them by just creating different kinds of writers at the start of the writing code.

Python web scrape - not displaying all containers

page_soup.findall won't seem to get all containers. when running len(containers) it shows I have 12 containers but its only pulling info from one. can someone plz help. I'm trying to get info for all 12 containers.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.newegg.com/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20card'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("div",{"class":"item-container"})
for container in containers:
brand = container.img["title"]
title_container = container.findAll("a",{"class":"item-title"})
product_name = title_container[0].text
shipping_container = container.findAll("li",{"class":"price-ship"})
shipping = shipping_container[0].text.strip()
print ("brand: " + brand)
print ("product_name: " + product_name)
print ("shipping : " + shipping)

your code looks good and it is getting all 12 containers, but you are printing only last one.
in order to print all, use last three print lines inside for loop. like this
for container in containers:
brand = container.img["title"]
title_container = container.findAll("a", {"class": "item-title"})
product_name = title_container[0].text
shipping_container = container.findAll("li", {"class": "price-ship"})
shipping = shipping_container[0].text.strip()
print("brand: " + brand)
print("product_name: " + product_name)
print("shipping : " + shipping)

Web Scraping - ResultSet object has no attribute 'findAll'

Having an issue with bs4 when reading second value in array within a for loop. Below I will paste the code.
However, when I use line #19, I receive no errors. When I swap it out for the entire array (line #18), It errors out when it attempts to gather the second value. Note that the second value in the array is the same value as line #19.
import requests
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
SmartLiving_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=Smart%20Living&selectedFacets=Brand%7CSmart%20Living&sortBy="
IEL_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=IEL&selectedFacets=Brand%7CIts%20Exciting%20Lighting&sortBy="
TD_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=two%20dogs&selectedFacets=Brand%7CTwo%20Dogs%20Designs&sortBy="
Headers = "Description, URL, Price \n"
text_file = open("HayneedlePrices.csv", "w")
text_file.write(Headers)
text_file.close()
URL_Array = [SmartLiving_IDS, IEL_IDS, TD_IDS]
#URL_Array = [IEL_IDS]
for URL in URL_Array:
print("\n" + "Loading New URL:" "\n" + URL + "\n" + "\n")
uClient = uReq(URL)
page_html = uClient.read()
uClient.close()
soup = soup(page_html, "html.parser")
Containers = soup.findAll("div", {"product-card__container___1U2Sb"})
for Container in Containers:
Title = Container.div.img["alt"]
Product_URL = Container.a["href"]
Price_Container = Container.findAll("div", {"class":"product-card__productInfo___30YSc body no-underline txt-black"})[0].findAll("span", {"style":"font-size:20px"})
Price_Dollars = Price_Container[0].get_text()
Price_Cents = Price_Container[1].get_text()
print("\n" + "#####################################################################################################################################################################################################" + "\n")
# print(" Container: " + "\n" + str(Container))
# print("\n" + "-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------" + "\n")
print(" Description: " + str(Title))
print(" Product URL: " + str(Product_URL))
print(" Price: " + str(Price_Dollars) + str(Price_Cents))
print("\n" + "#####################################################################################################################################################################################################" + "\n")
text_file = open("HayneedlePrices.csv", "a")
text_file.write(str(Title) + ", " + str(Product_URL) + ", " + str(Price_Dollars) + str(Price_Cents) + "\n")
text_file.close()
print("Information gathered and Saved from URL Successfully.")
print("Looking for Next URL..")
print("No Additional URLs to Gather. Process Completed.")

The problem is that you import BeautifulSoup as soup and also define a variable soup = soup(page_html, "html.parser") with the same name!
I refactored your code a bit, let me know if it works as expected!
import csv
import requests
from bs4 import BeautifulSoup
smart_living_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=Smart%20Living&selectedFacets=Brand%7CSmart%20Living&sortBy="
IEL_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=IEL&selectedFacets=Brand%7CIts%20Exciting%20Lighting&sortBy="
TD_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=two%20dogs&selectedFacets=Brand%7CTwo%20Dogs%20Designs&sortBy="
site_URLs = [smart_living_IDS, IEL_IDS, TD_IDS]
sess = requests.Session()
prod_data = []
for curr_URL in site_URLs:
req = sess.get(url=curr_URL)
soup = BeautifulSoup(req.content, "lxml")
containers = soup.find_all("div", {"product-card__container___1U2Sb"})
for curr_container in containers:
prod_title = curr_container.div.img["alt"]
prod_URL = curr_container.a["href"]
price_container = curr_container.find(
"div",
{"class": "product-card__productInfo___30YSc body no-underline txt-black"},
)
dollars_elem = price_container.find("span", {"class": "main-price-dollars"})
cents_elem = dollars_elem.find_next("span")
prod_price = dollars_elem.get_text() + cents_elem.get_text()
prod_price = float(prod_price[1:])
prod_data.append((prod_title, prod_URL, prod_price))
CSV_headers = ("title", "URL", "price")
with open("../out/hayneedle_prices.csv", "w", newline="") as file_out:
writer = csv.writer(file_out)
writer.writerow(CSV_headers)
writer.writerows(prod_data)
I tested it by repeating the current URL list 10 times, it took longer than I was anticipating. There are certainly improvements to be made, I might rewrite it to use lxml in the next few days, and multiprocessing might also be a good option. It all depends on how you're using this, of course :)

Looping through WC in python not working

I have written a web scraper according to a youtube vid. It gives me just one container from all 48 containers.
Why isn't my code looping through all the containers automatically? What did I miss here?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.tradera.com/search?itemStatus=Ended&q=iphone+6+-6s+64gb+-plus'
#
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#Container
containers = page_soup.findAll("div",{"class":"item-card-details"})
filename = "ip6.csv"
f = open(filename, "w")
headers = "title, link, price, bids\n"
f.write(headers)
for container in containers:
title = container.div.div.h3["title"]
link = container.div.div.h3.a["href"]
price_container = container.findAll("span",{"class":"item-card-details-price-amount"})
price = price_container[0].text
bid_container = container.findAll("span",{"class":"item-card-details-bids"})
bids = bid_container[0].text
print("title: " + title)
print("link: " + link)
print("price: " + price)
print("bids: " + bids)
f.write(title + "," + link + "," + price + "," + bids + "\n")
f.close

Because the loop is "empty". In python you have to indent the block of code that should run inside the loop, e.g.:
for i in loop:
# do something
In your code:
for container in containers:
title = container.div.div.h3["title"]
link = container.div.div.h3.a["href"]
price_container = container.findAll("span",{"class":"item-card-details-price-amount"})
price = price_container[0].text
bid_container = container.findAll("span",{"class":"item-card-details-bids"})
bids = bid_container[0].text
print("title: " + title)
print("link: " + link)
print("price: " + price)
print("bids: " + bids)
f.write(title + "," + link + "," + price + "," + bids + "\n")
f.close

You asked me what was going on and why I get the correct result. Below the script adjusted for py 3.5. As it appears some error occurs at the print line. I by accident almost fixed your script in your question itself.
As Ilja pointed out there were indentation errors and its correct he mentioned empty list returns... prior to my accidental partial fix. What I missed out in the accidental fix was not bringing in the print statements into the for-loop. So I get one result. Checking the web-page... you want to collect all phone products.
Below script fixes all the issues by having the print-statements inside the for-loop. Thus in your Pycharm standard output you should now hget many blocks of printed products. And fixing your file wire should show similar result in the csv-file.
Py3.5+ is a little bit childish when it comes to print ('title' + title`). IMO... style py2.x should have been kept as it gives more flexibility and reduces RSI by less typing. Anyway, the iteration through this phone web-page should work now like a pyCharm..
repr comment : no you didn't use repr at all and its not needed but....
For print syntax examples check here and for the official python docs here.
In addition, I've added some formatting code for your output-file. It should be now in columns... and readable. Enjoy!
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.tradera.com/search?itemStatus=Ended&q=iphone+6+-6s+64gb+-plus'
#
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#Container
containers = page_soup.findAll("div",{"class":"item-card-details"})
filename = "ip6.csv"
f = open(filename, "w")
headers = "title, link, price, bids\n"
f.write(headers)
l1 = 0
l2 = 0
l3 = 0
# get longest entry per item for string/column-formatting
for container in containers:
title = container.div.div.h3["title"]
t = len(title)
if t > l1:
l1 = t
link = container.div.div.h3.a["href"]
price_container = container.findAll("span",{"class":"item-card-details-price-amount"})
price = price_container[0].text
p = len(price)
if p > l2:
l2 = p
bid_container = container.findAll("span",{"class":"item-card-details-bids"})
bids = bid_container[0].text
b = len(bids)
if b > l3:
l3 = b
for container in containers:
title = container.div.div.h3["title"]
link = container.div.div.h3.a["href"]
price_container = container.findAll("span",{"class":"item-card-details-price-amount"})
price = price_container[0].text
bid_container = container.findAll("span",{"class":"item-card-details-bids"})
bids = bid_container[0].text
# claculate distances between columns
d1 = l1-len(title) + 0
d2 = l2-len(price) + 1
d3 = l3-len(bids) + 1
d4 = 2
print("title : %s-%s %s." % (l1, d1, title))
print("price : %s-%s %s." % (l2, d2, price))
print("bids : %s-%s %s." % (l3, d3, bids))
print("link : %s." % link)
f.write('%s%s, %s%s, %s%s, %s%s\n' % (title, d1* ' ', d2* ' ', price, d3 * ' ', bids, d4 * ' ', link))
f.close

Thank you all for helping me solve this. It was the indentation of the print lines. You are the best!

python script - Scraping Data but not looping and bringing back all results

I am new to python and need some help
See py script below, it brings back information for one entry but I want it to bring back all items that come up on that URL including on the pages not shown, What needs changing on the below to do that?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as Soup
my_url = 'https://www.newegg.com/global/uk/Product/ProductList.aspx?
Submit=ENE&DEPA=0&Order=BESTMATCH&Description=graphics+card&N=-1&isNodeId=1'
uClient=uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = Soup(page_html, 'html.parser')
containers = page_soup.findAll('div',{'class':'item-container'})
for container in containers:
brand = container.div.div.a.img['title']
title_container = container.findAll('a',{'class':'item-title'})
product_name = title_container [0].text
price_container = container.findAll('li',{'class':'price-current'})
Price = price_container[0].text.strip()
print("brand: " + brand)
print("product_name: " + product_name)
print("Price: + " + Price)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Web Scraper issue - python

Related

How to stop trailing zeros splitting across cell with Beautiful Soup?

Python web scrape - not displaying all containers

Web Scraping - ResultSet object has no attribute 'findAll'

Looping through WC in python not working

python script - Scraping Data but not looping and bringing back all results

Categories

Resources