I have written a web scraper according to a youtube vid. It gives me just one container from all 48 containers.
Why isn't my code looping through all the containers automatically? What did I miss here?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.tradera.com/search?itemStatus=Ended&q=iphone+6+-6s+64gb+-plus'
#
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#Container
containers = page_soup.findAll("div",{"class":"item-card-details"})
filename = "ip6.csv"
f = open(filename, "w")
headers = "title, link, price, bids\n"
f.write(headers)
for container in containers:
title = container.div.div.h3["title"]
link = container.div.div.h3.a["href"]
price_container = container.findAll("span",{"class":"item-card-details-price-amount"})
price = price_container[0].text
bid_container = container.findAll("span",{"class":"item-card-details-bids"})
bids = bid_container[0].text
print("title: " + title)
print("link: " + link)
print("price: " + price)
print("bids: " + bids)
f.write(title + "," + link + "," + price + "," + bids + "\n")
f.close
Because the loop is "empty". In python you have to indent the block of code that should run inside the loop, e.g.:
for i in loop:
# do something
In your code:
for container in containers:
title = container.div.div.h3["title"]
link = container.div.div.h3.a["href"]
price_container = container.findAll("span",{"class":"item-card-details-price-amount"})
price = price_container[0].text
bid_container = container.findAll("span",{"class":"item-card-details-bids"})
bids = bid_container[0].text
print("title: " + title)
print("link: " + link)
print("price: " + price)
print("bids: " + bids)
f.write(title + "," + link + "," + price + "," + bids + "\n")
f.close
You asked me what was going on and why I get the correct result. Below the script adjusted for py 3.5. As it appears some error occurs at the print line. I by accident almost fixed your script in your question itself.
As Ilja pointed out there were indentation errors and its correct he mentioned empty list returns... prior to my accidental partial fix. What I missed out in the accidental fix was not bringing in the print statements into the for-loop. So I get one result. Checking the web-page... you want to collect all phone products.
Below script fixes all the issues by having the print-statements inside the for-loop. Thus in your Pycharm standard output you should now hget many blocks of printed products. And fixing your file wire should show similar result in the csv-file.
Py3.5+ is a little bit childish when it comes to print ('title' + title`). IMO... style py2.x should have been kept as it gives more flexibility and reduces RSI by less typing. Anyway, the iteration through this phone web-page should work now like a pyCharm..
repr comment : no you didn't use repr at all and its not needed but....
For print syntax examples check here and for the official python docs here.
In addition, I've added some formatting code for your output-file. It should be now in columns... and readable. Enjoy!
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.tradera.com/search?itemStatus=Ended&q=iphone+6+-6s+64gb+-plus'
#
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#Container
containers = page_soup.findAll("div",{"class":"item-card-details"})
filename = "ip6.csv"
f = open(filename, "w")
headers = "title, link, price, bids\n"
f.write(headers)
l1 = 0
l2 = 0
l3 = 0
# get longest entry per item for string/column-formatting
for container in containers:
title = container.div.div.h3["title"]
t = len(title)
if t > l1:
l1 = t
link = container.div.div.h3.a["href"]
price_container = container.findAll("span",{"class":"item-card-details-price-amount"})
price = price_container[0].text
p = len(price)
if p > l2:
l2 = p
bid_container = container.findAll("span",{"class":"item-card-details-bids"})
bids = bid_container[0].text
b = len(bids)
if b > l3:
l3 = b
for container in containers:
title = container.div.div.h3["title"]
link = container.div.div.h3.a["href"]
price_container = container.findAll("span",{"class":"item-card-details-price-amount"})
price = price_container[0].text
bid_container = container.findAll("span",{"class":"item-card-details-bids"})
bids = bid_container[0].text
# claculate distances between columns
d1 = l1-len(title) + 0
d2 = l2-len(price) + 1
d3 = l3-len(bids) + 1
d4 = 2
print("title : %s-%s %s." % (l1, d1, title))
print("price : %s-%s %s." % (l2, d2, price))
print("bids : %s-%s %s." % (l3, d3, bids))
print("link : %s." % link)
f.write('%s%s, %s%s, %s%s, %s%s\n' % (title, d1* ' ', d2* ' ', price, d3 * ' ', bids, d4 * ' ', link))
f.close
Thank you all for helping me solve this. It was the indentation of the print lines. You are the best!
Related
I'm quite new to web scraping using BeautifuSoup and I'm attempting to loop through multiple pages and write everything into a CSV file.
Here my current code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
urls = []
url = "https://www.newegg.com/Black-Friday-Deals/EventSaleStore/ID-10475/"
for page in range (1,7):
pageurl = 'Page-{1}'.format(url,page)
urls.append(pageurl)
page = urlopen(url)
html = page.read().decode("utf-8")
page.close()
page_soup = soup(html, "html.parser")
containers = page_soup.findAll("div",{"class" : "item-container"})
print(len(containers))
filename = "BlackFridayNewegg.csv"
f = open(filename, "w")
headers = "Product, Previous price, Current price"
f.write(headers)
for container in containers:
rating_container = page_soup.findAll("span", {"class" : "item-rating-num"})
rating = rating_container[0].text.strip()
title_container = container.findAll("a",{"class":"item-title"})
title = title_container[0].text.strip()
prev_price_container = container.findAll("li",{"class" : "price-was"})
prev_price = prev_price_container[0].text
current_price_container = container.findAll("li", {"class" : "price-current"})
current_price = current_price_container[0].text
print("Product Name: " +title)
print("Previous Price: " + prev_price)
print("Current Price: "+current_price)
result = title.replace(",","")+ "," + prev_price +"," +current_price +"\n"
f.write(result)
f.close()
My code is working properly and will display text from multiple pages, but it won't write it all into the file. Any reason as to why this is happening?
f = open(filename, "w") clears the file. You need f = open(filename, "a")
Having an issue with bs4 when reading second value in array within a for loop. Below I will paste the code.
However, when I use line #19, I receive no errors. When I swap it out for the entire array (line #18), It errors out when it attempts to gather the second value. Note that the second value in the array is the same value as line #19.
import requests
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
SmartLiving_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=Smart%20Living&selectedFacets=Brand%7CSmart%20Living&sortBy="
IEL_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=IEL&selectedFacets=Brand%7CIts%20Exciting%20Lighting&sortBy="
TD_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=two%20dogs&selectedFacets=Brand%7CTwo%20Dogs%20Designs&sortBy="
Headers = "Description, URL, Price \n"
text_file = open("HayneedlePrices.csv", "w")
text_file.write(Headers)
text_file.close()
URL_Array = [SmartLiving_IDS, IEL_IDS, TD_IDS]
#URL_Array = [IEL_IDS]
for URL in URL_Array:
print("\n" + "Loading New URL:" "\n" + URL + "\n" + "\n")
uClient = uReq(URL)
page_html = uClient.read()
uClient.close()
soup = soup(page_html, "html.parser")
Containers = soup.findAll("div", {"product-card__container___1U2Sb"})
for Container in Containers:
Title = Container.div.img["alt"]
Product_URL = Container.a["href"]
Price_Container = Container.findAll("div", {"class":"product-card__productInfo___30YSc body no-underline txt-black"})[0].findAll("span", {"style":"font-size:20px"})
Price_Dollars = Price_Container[0].get_text()
Price_Cents = Price_Container[1].get_text()
print("\n" + "#####################################################################################################################################################################################################" + "\n")
# print(" Container: " + "\n" + str(Container))
# print("\n" + "-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------" + "\n")
print(" Description: " + str(Title))
print(" Product URL: " + str(Product_URL))
print(" Price: " + str(Price_Dollars) + str(Price_Cents))
print("\n" + "#####################################################################################################################################################################################################" + "\n")
text_file = open("HayneedlePrices.csv", "a")
text_file.write(str(Title) + ", " + str(Product_URL) + ", " + str(Price_Dollars) + str(Price_Cents) + "\n")
text_file.close()
print("Information gathered and Saved from URL Successfully.")
print("Looking for Next URL..")
print("No Additional URLs to Gather. Process Completed.")
The problem is that you import BeautifulSoup as soup and also define a variable soup = soup(page_html, "html.parser") with the same name!
I refactored your code a bit, let me know if it works as expected!
import csv
import requests
from bs4 import BeautifulSoup
smart_living_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=Smart%20Living&selectedFacets=Brand%7CSmart%20Living&sortBy="
IEL_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=IEL&selectedFacets=Brand%7CIts%20Exciting%20Lighting&sortBy="
TD_IDS = "https://www.hayneedle.com/search/index.cfm?categoryID=&page=1&searchQuery=two%20dogs&selectedFacets=Brand%7CTwo%20Dogs%20Designs&sortBy="
site_URLs = [smart_living_IDS, IEL_IDS, TD_IDS]
sess = requests.Session()
prod_data = []
for curr_URL in site_URLs:
req = sess.get(url=curr_URL)
soup = BeautifulSoup(req.content, "lxml")
containers = soup.find_all("div", {"product-card__container___1U2Sb"})
for curr_container in containers:
prod_title = curr_container.div.img["alt"]
prod_URL = curr_container.a["href"]
price_container = curr_container.find(
"div",
{"class": "product-card__productInfo___30YSc body no-underline txt-black"},
)
dollars_elem = price_container.find("span", {"class": "main-price-dollars"})
cents_elem = dollars_elem.find_next("span")
prod_price = dollars_elem.get_text() + cents_elem.get_text()
prod_price = float(prod_price[1:])
prod_data.append((prod_title, prod_URL, prod_price))
CSV_headers = ("title", "URL", "price")
with open("../out/hayneedle_prices.csv", "w", newline="") as file_out:
writer = csv.writer(file_out)
writer.writerow(CSV_headers)
writer.writerows(prod_data)
I tested it by repeating the current URL list 10 times, it took longer than I was anticipating. There are certainly improvements to be made, I might rewrite it to use lxml in the next few days, and multiprocessing might also be a good option. It all depends on how you're using this, of course :)
I'm new to to programming and trying to learn by building some small side projects. I have this code and it is working but I am having an issue with it formatting correctly in csv when it pulls all the information. It started adding weird spaces after I added price to be pulled as well. if I comment out price and remove it from write it works fine but I can't figure out why I am getting weird spaces when I add it back.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=graphics%20card&bop=And&PageSize=12&order=BESTMATCH"
# Opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
# grabs each products
containers = page_soup.findAll("div",{"class":"item-container"})
filename = "products.csv"
f = open(filename, "w")
headers = "brand, product_name, shipping\n"
f.write(headers)
for container in containers:
brand = container.div.div.a.img["title"]
title_container = container.findAll("a", {"class":"item-title"})
product_name = title_container[0].text
shipping_container = container.findAll("li", {"class":"price-ship"})
shipping = shipping_container[0].text.strip()
price_container = container.findAll("li", {"class":"price-current"})
price = price_container[0].text.strip()
print("brand: " + brand)
print("product_name: " + product_name)
print("Price: " + price)
print("shipping: " + shipping)
f.write(brand + "," + product_name.replace(",", "|") + "," + shipping + "," + price + "\n")
f.close()
You can write to a csv file like the way I've showed below. The output it produces should serve the purpose. Check out this documentation to get the clarity.
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
my_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=graphics%20card&bop=And&PageSize=12&order=BESTMATCH"
page_html = urlopen(my_url).read()
page_soup = BeautifulSoup(page_html, "lxml")
with open("outputfile.csv","w",newline="") as infile:
writer = csv.writer(infile)
writer.writerow(["brand", "product_name", "shipping", "price"])
for container in page_soup.findAll("div",{"class":"item-container"}):
brand = container.find(class_="item-brand").img.get("title")
product_name = container.find("a", {"class":"item-title"}).get_text(strip=True).replace(",", "|")
shipping = container.find("li", {"class":"price-ship"}).get_text(strip=True)
price = container.find("li", {"class":"price-current"}).get_text(strip=True).replace("|", "")
writer.writerow([brand,product_name,shipping,price])
You're getting the new lines and spam characters because that is the data you're getting back from BS4: it isn't a product of the writing process. This is because you're trying to get all the text in the list item, whilst there's a lot going on in there. Having a look at the page, if you'd rather just get the price, you can concatenate the text of the strong tag within the list with the text of the sup tag, e.g. price = price_container[0].find("strong").text + price_container[0].find("sup").text. That will ensure you're only picking out the data that you need.
A newbie scraper here !
I am currently indulged in a tedious and boring task where I have to copy/paste certain contents from Angel List and save them in excel. I have previously used scrapers to automate such boring tasks but this one is quite tough and I am unable to find a way to automate it. Please find below the website link:
https://angel.co/people/all
Kindly apply filters Location-> USA, and Market-> Online Dating. There will be around 550 results (please note that the URL doesn't change when you apply the filters)
I have successfully scraped the URLs of all the profiles once filters are applied. Therefore, I have an excel file with 550 URLs of these profiles.
Now the next step is to go to individual profiles and scrape certain information. I am looking for these fields currently:
Name
Description Information
Investments
Founder
Advisor
Locations
Markets
What I'm looking for
Now I have tried a lot of solutions but none have worked so far. Import.io, data miner, data scraper tools are not helping me much.
Please suggest is there any VBA code or Python code or any tool that can help me to automate this scraping task?
COMPLETE CODE FOR SOLUTION:
Here is the final code with comments. If someone still has problems, please comment below and I will try to help you out.
from bs4 import BeautifulSoup
import urllib2
import json
import csv
def fetch_page(url):
opener = urllib2.build_opener()
# changing the user agent as the default one is banned
opener.addheaders = [('User-Agent', 'Mozilla/43.0.1')]
return opener.open(url).read()
#Create a CSV File.
f = open('angle_profiles.csv', 'w')
# Row Headers
f.write("URL" + "," + "Name" + "," + "Founder" + "," + "Advisor" + "," + "Employee" + "," + "Board Member" + ","
+ "Customer" + "," + "Locations" + "," + "Markets" + "," + "Investments" + "," + "What_iam_looking_for" + "\n")
# URLs to iterate over has been saved in file: 'profiles_links.csv' . I will extract the URLs individually...
index = 1;
with open("profiles_links.csv") as f2:
for row in map(str.strip,f2):
url = format(row)
print "# Index: ", index
index += 1;
# Check if URL has 404 error. if yes, skip and continue with the rest of URLs.
try:
html = fetch_page(url)
page = urllib2.urlopen(url)
except Exception, e:
print "Error 404 #: " , url
continue
bs = BeautifulSoup(html, "html.parser")
#Extract info from page with these tags..
name = bs.select(".profile-text h1")[0].get_text().strip()
#description = bs.select('div[data-field="bio"]')[0]['data-value']
founder = map(lambda link: link.get_text().strip(), bs.select('.role_founder a'))
advisor = map(lambda link: link.get_text().strip(), bs.select('.role_advisor a'))
employee = map(lambda link: link.get_text().strip(), bs.select('.role_employee a'))
board_member = map(lambda link: link.get_text().strip(), bs.select('.role_board_member a'))
customer = map(lambda link: link.get_text().strip(), bs.select('.role_customer a'))
class_wrapper = bs.body.find('div', attrs={'data-field' : 'tags_interested_locations'})
count = 1
locations = {}
if class_wrapper is not None:
for span in class_wrapper.find_all('span'):
locations[count] = span.text
count +=1
class_wrapper = bs.body.find('div', attrs={'data-field' : 'tags_interested_markets'})
count = 1
markets = {}
if class_wrapper is not None:
for span in class_wrapper.find_all('span'):
markets[count] = span.text
count +=1
what_iam_looking_for = ' '.join(map(lambda p: p.get_text().strip(), bs.select('div.criteria p')))
user_id = bs.select('.profiles-show .profiles-show')[0]['data-user_id']
# investments are loaded using separate request and response is in JSON format
json_data = fetch_page("https://angel.co/startup_roles/investments?user_id=%s" % user_id)
investment_records = json.loads(json_data)
investments = map(lambda x: x['company']['company_name'], investment_records)
# Make sure that every variable is in string
name2 = str(name); founder2 = str(founder); advisor2 = str (advisor); employee2 = str(employee)
board_member2 = str(board_member); customer2 = str(customer); locations2 = str(locations); markets2 = str (markets);
what_iam_looking_for2 = str(what_iam_looking_for); investments2 = str(investments);
# Replace any , found with - so that csv doesn't confuse it as col separator...
name = name2.replace(",", " -")
founder = founder2.replace(",", " -")
advisor = advisor2.replace(",", " -")
employee = employee2.replace(",", " -")
board_member = board_member2.replace(",", " -")
customer = customer2.replace(",", " -")
locations = locations2.replace(",", " -")
markets = markets2.replace(",", " -")
what_iam_looking_for = what_iam_looking_for2.replace(","," -")
investments = investments2.replace(","," -")
# Replace u' with nothing
name = name.replace("u'", "")
founder = founder.replace("u'", "")
advisor = advisor.replace("u'", "")
employee = employee.replace("u'", "")
board_member = board_member.replace("u'", "")
customer = customer.replace("u'", "")
locations = locations.replace("u'", "")
markets = markets.replace("u'", "")
what_iam_looking_for = what_iam_looking_for.replace("u'", "")
investments = investments.replace("u'", "")
# Write the information back to the file... Note \n is used to jump one row ahead...
f.write(url + "," + name + "," + founder + "," + advisor + "," + employee + "," + board_member + ","
+ customer + "," + locations + "," + markets + "," + investments + "," + what_iam_looking_for + "\n")
Feel free to test the above code with any of the following links:
https://angel.co/idg-ventures?utm_source=people
https://angel.co/douglas-feirstein?utm_source=people
https://angel.co/andrew-heckler?utm_source=people
https://angel.co/mvklein?utm_source=people
https://angel.co/rajs1?utm_source=people
HAPPY CODING :)
For my recipe you will need to install BeautifulSoup using pip or easy_install
from bs4 import BeautifulSoup
import urllib2
import json
def fetch_page(url):
opener = urllib2.build_opener()
# changing the user agent as the default one is banned
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
return opener.open(url).read()
html = fetch_page("https://angel.co/davidtisch")
# or load from local file
#html = open('page.html', 'r').read()
bs = BeautifulSoup(html, "html.parser")
name = bs.select(".profile-text h1")[0].get_text().strip()
description = bs.select('div[data-field="bio"]')[0]['data-value']
founder = map(lambda link: link.get_text().strip(), bs.select('.role_founder a'))
advisor = map(lambda link: link.get_text().strip(), bs.select('.role_advisor a'))
locations = map(lambda link: link.get_text().strip(), bs.select('div[data-field="tags_interested_locations"] a'))
markets = map(lambda link: link.get_text().strip(), bs.select('div[data-field="tags_interested_markets"] a'))
what_iam_looking_for = ' '.join(map(lambda p: p.get_text().strip(), bs.select('div.criteria p')))
user_id = bs.select('.profiles-show .profiles-show')[0]['data-user_id']
# investments are loaded using separate request and response is in JSON format
json_data = fetch_page("https://angel.co/startup_roles/investments?user_id=%s" % user_id)
investment_records = json.loads(json_data)
investments = map(lambda x: x['company']['company_name'], investment_records)
Take a look at https://scrapy.org/
It allows write parser very quickly. Here's my example parser for one site alike angel.co: https://gist.github.com/lisitsky/c4aac52edcb7abfd5975be067face1bb
Unfortunately, angel.co is not available for me now. Good point to start:
$ pip install scrapy
$ cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://angel.co']
def parse(self, response):
# here's selector to extract interesting elements
for title in response.css('h2.entry-title'):
# write down here values you'd like to extract from the element
yield {'title': title.css('a ::text').extract_first()}
# how to find next page
next_page = response.css('div.prev-post > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
EOF
$ scrapy runspider myspider.py
Enter interesting css-selectors and run spider.
I would like to know how to export my results from crawling into multiple csv files for each different city that I have crawled. Somehow I´m running into walls, do not get a proper way to sort it out.
That is my code:
import requests
from bs4 import BeautifulSoup
import csv
user_agent = {'User-agent': 'Chrome/43.0.2357.124'}
output_file= open("TA.csv", "w", newline='')
RegionIDArray = [187147,187323,186338]
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
for reg in RegionIDArray:
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "element_wrap"})
for item in g_data:
header = item.find_all("div", {"class": "property_title"})
item = (header[0].text.strip())
if item not in already_printed:
already_printed.add(item)
print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]))
writer = csv.writer(output_file)
csv_fields = ['POI', 'Locaton']
if g_data:
writer.writerow([str(item), str(dict[reg])])
My goal would be that I get three sperate CSV files for Paris, Berlin and London instead of getting all the results in one big csv file.
Could you guys help me out? Thanks for your feedback:)
I did some minor modifications to your code. To make files for each locale, I moved the out_file name inside the loop.
Note, that I don't have time now, the very last line is a hack to ignore unicode errors -- it just skips trying to output a line with a non ascii character. Thas isn't good. Maybe someone can fix that part?
import requests
from bs4 import BeautifulSoup
import csv
user_agent = {'User-agent': 'Chrome/43.0.2357.124'}
RegionIDArray = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
for reg in RegionIDArray:
output_file= open("TA" + str(reg) + ".csv", "w")
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "element_wrap"})
for item in g_data:
header = item.find_all("div", {"class": "property_title"})
item = (header[0].text.strip())
if item not in already_printed:
already_printed.add(item)
# print("POI: " + str(item) + " | " + "Location: " + str(RegionIDArray[reg]))
writer = csv.writer(output_file)
csv_fields = ['POI', 'Locaton']
if g_data:
try:
writer.writerow([str(item), str(RegionIDArray[reg])])
except:
pass