I am trying to scrape page. I can get it to pull all the data and save it to array objects but cannot get my for loop to iterate over every index of the arrays and output those to CSV. It will write the headers and the first object. Novice to writing code so any help is appreciated.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.sports-reference.com/cfb/schools/air-force/'
# Open Connection & Grabbing the Page
uClient = uReq(my_url)
#Creating variable to Save the Page
page_html = uClient.read()
#Closing the connection
uClient.close()
#Parse the data to HTML
page_soup = soup(page_html, "html.parser")
#Grab container info from the DOM
containers = page_soup.findAll("div",{"class":"overthrow table_container"})
filename = "airforce.csv"
f = open(filename, "w")
headers = "year, wins, losses, ties, wl, sos\n"
f.write(headers)
for container in containers:
#Find all years
year_container = container.findAll("td",{"data-stat":"year_id"})
year = year_container[0].text
#Find number of Wins
wins_container = container.findAll("td",{"data-stat":"wins"})
wins = wins_container[0].text
#Find number of Wins
losses_container = container.findAll("td",{"data-stat":"losses"})
losses = losses_container[0].text
#Number of Ties if any
ties_container = container.findAll("td",{"data-stat":"ties"})
ties = ties_container[0].text
#Win-Loss as a percentage
wl_container = container.findAll("td",{"data-stat":"win_loss_pct"})
wl = wl_container[0].text
#Strength of Schedule. Can be +/- w/0 being average
sos_container = container.findAll("td",{"data-stat":"sos"})
sos = sos_container[0].text
f.write(year + "," + wins + "," + losses + "," + ties + "," + wl + "," +
sos + "\n")
f.close()
You want to find the table (body) and then iterate over the table rows that are not header rows, i.e. all rows that don't have a class.
For writing (and reading) CSV files there is a csv module in the standard library.
import csv
from urllib.request import urlopen
import bs4
def iter_rows(html):
headers = ['year_id', 'wins', 'losses', 'ties', 'win_loss_pct', 'sos']
yield headers
soup = bs4.BeautifulSoup(html, 'html.parser')
table_body_node = soup.find('table', 'stats_table').tbody
for row_node in table_body_node('tr'):
if not row_node.get('class'):
yield [
row_node.find('td', {'data-stat': header}).text
for header in headers
]
def main():
url = 'https://www.sports-reference.com/cfb/schools/air-force/'
with urlopen(url) as response:
html = response.read()
with open('airforce.csv', 'w') as csv_file:
csv.writer(csv_file).writerows(iter_rows(html))
if __name__ == '__main__':
main()
Pulling up the html source code, there is only one container to be put into your container list. Which means that your for loop is trying to access the wrong information.
You should use a range() generator to access the different elements of td that reside inside of the one item in your containers list.
try this
#number of records to iterate over
num = len(list(containers.findAll("td",{"data-stat":"year_id"})))
for i in range(num):
#Find all years
year_container = containers.findAll("td",{"data-stat":"year_id"})
year = year_containers[i].text
Related
I'm building a web scraper. The top line on this data scrape splits the title because there the number "1,000" at the end. How do I stop this from happening?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.topcashback.co.uk/easyjet-holidays/'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr")[1:]
filename = "topcashbackEasyJetholidays.csv"
f = open(filename,"w")
headers = "title, rate \n"
f.write(headers)
for container in containers:
title = container.td.div.span.text
rate = container.find("span",{"class":"cashback-desc"}).text
print("title: " + title)
print("rate: " + rate)
f.write(title + "," + rate + "," "\n")
f.close()
The easy and ugly way - cover title with quotes so the comma in 1,000 won't be treat as separator in csv.
f.write('"' + title + '",' + rate + "," "\n") # btw. why the last comma?
# or with f-string
f.write(f'"{title}",{rate}\n")
The more fancy way - use csvwriter
I would check out this before trying to reinvent the wheel:
import pandas as pd
my_url = 'https://www.topcashback.co.uk/easyjet-holidays/'
tables = pd.read_html(my_url, encoding='utf-8')
df = tables[0]
df.columns = ['title', 'n/a', 'rate']
df = df[['title', 'rate']]
df.to_csv("topcashbackEasyJetholidays.csv", index=False)
print(df)
Output:
title rate
0 London Gatwick Departures over £1,000 £50.00
1 Holiday Bookings £1000 and Over £40.00
2 Holiday Bookings £999 and Under £25.00
CSV:
title,rate
"London Gatwick Departures over £1,000",£50.00
Holiday Bookings £1000 and Over,£40.00
Holiday Bookings £999 and Under,£25.00
You'll also need to have lxml installed, aka pip install lxml
Here's the "fancy way", which I think is clearly the better way to go. I find it to actually be an easier and simpler way to code up the problem:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv
my_url = 'https://www.topcashback.co.uk/easyjet-holidays/'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr")[1:]
filename = "topcashbackEasyJetholidays.csv"
with open(filename,"w") as f:
writer = csv.writer(f)
writer.writerow(["title", "rate"])
for container in containers:
title = container.td.div.span.text
rate = container.find("span",{"class":"cashback-desc"}).text
print("title: " + title)
print("rate: " + rate)
writer.writerow([title, rate])
There are other advantages to using a CSV writer. The code is more readable and the details of the CSV file format are hidden. There are other characters that could cause you problems and the CSV writer will transparently deal with all of them. The CSV writer will only use quotes when it has to, making your CSV file smaller. If you support multiple output formats, the same code can be used to write all of them by just creating different kinds of writers at the start of the writing code.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-
events/?page=1'
#opening connection , downloading page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parser
page_soup = soup(page_html, "html.parser")
# catch each events
card = page_soup.findAll("div",{"class":"eds-media-card-content__content"})
filename = "Data_Events.csv"
f = open(filename, "w")
headers = "events_name, events_dates, events_location, events_fees\n"
f.write(headers)
for activity in card :
event_activity = activity.findAll("div",{"class":"eds-event-
card__formatted-name--is-clamped"})
events_name = event_activity[0].text
event_date = activity.findAll("div",{"class":"eds-text-bs--fixed eds-
text-color--grey-600 eds-l-mar-top-1"})
events_dates = event_date[0].text
events_location = event_date[1].text
events_fees = event_date[2].text
print("events_name: " + events_name)
print("events_dates: " + events_dates)
print("events_location: " + events_location)
print("events_fees: " + events_fees)
f.write(events_name + "," + events_dates + "," + events_location + "," +
events_fees + "\n")
f.close()
Hi, i am still a beginner in using Python language and i would like to know how can i apply a function where this script is able to get data to a next page within the website?
I have try to do a
for pages in page (1, 49)
my_url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-
events/?page=1'
Any advice would be appreciated
import itertools
import requests
from bs4 import BeautifulSoup
def parse_page(url, page)
params = dict(page=page)
resp = requests.get(url, params=params) # will format `?page=#` to url
soup = BeautifulSoup(resp.text, 'html.parser')
... # parse data from page
url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-events'
for page in itertools.count(start=1): # don't need to know total pages
try:
parse_page(url, page)
except Exception:
# `parse_url` was designed for a different page layout and will
# fail when no more pages to scrape, so we break here
break
I have a script that loops through multiple pages of a website and I want to skip over or add a blank space for the item that might not be on certain pages. For example, there are some pages that do not contain a license. When I run into one of those pages I get an attribute error. My script below loops through the first two pages with no problem, but when it hits the third page it stops. How can I fix this? Here is my script:
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json
base_url = "https://open.umn.edu/opentextbooks/"
data = []
n = 50
for i in range(4, n+1):
response = urlopen(base_url + "BookDetail.aspx?bookId=" + str(i))
page_html = response.read()
response.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs info for each textbook
containers = page_soup.findAll("div",{"class":"LongDescription"})
author = page_soup.select("p")
for container in containers:
item = {}
item['type'] = "Textbook"
item['title'] = container.find("div",{"class":"twothird"}).h1.text
item['author'] = author[3].get_text(separator=', ')
if item['author'] == " ":
item['author'] = "University of Minnesota Libraries Publishing"
item['link'] = "https://open.umn.edu/opentextbooks/BookDetail.aspx?bookId=" + str(i)
item['source'] = "Open Textbook Library"
item['base_url'] = "https://open.umn.edu/opentextbooks/"
item['license'] = container.find("p",{"class":"Badge-Condition"}).a.text
if item['license'] != container.find("p",{"class":"Badge-Condition"}).a.text:
item['license'] = ""
item['license_url'] = container.find("p",{"class":"Badge-Condition"}).a["href"]
data.append(item) # add the item to the list
with open("./json/noSubject/otl-loop.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
I figured it out. My main issue was with item['license'] Here is my fix:
if container.find("p",{"class":"Badge-Condition"}).a:
item['license'] = container.find("p",{"class":"Badge-Condition"}).a.text
if container.find("img",{"class":"ctl00_maincontent_imgLicence"}):
item['license'] = ''
if container.find("p",{"class":"Badge-Condition"}).a:
item['license_url'] = container.find("p",{"class":"Badge-Condition"}).a["href"]
if container.find("img",{"class":"ctl00_maincontent_imgLicence"}):
item['license_url'] = ''
I am new to programming and am trying to build my first little web crawler in python.
Goal: Crawling a product list page - scraping brand name, article name, original price and new price - saving in CSV file
Status: I've managed to get the brand name, article name as well as original price and put them into correct order into a list (e.g. 10 products). As there is a brand name, description and price for all items, my code get them in correct order into the csv.
Code:
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myUrl = 'https://www.zalando.de/rucksaecke-herren/'
#open connection, grabbing page, saving in page_html and closing connection
uClient = uReq(myUrl)
page_html = uClient.read()
uClient.close()
#Datatype, html paser
page_soup = soup(page_html, "html.parser")
#grabbing information
brand_Names = page_soup.findAll("div",{"class": "z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn"})
articale_Names = page_soup.findAll ("div",{"class": "z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn"})
original_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_originalPrice-2Oy4G"})
new_Prices = page_soup.findAll("div",{"class": "z-nvg-cognac_promotionalPrice-3GRE7"})
#opening a csv file and printing its header
filename = "XXX.csv"
file = open(filename, "w")
headers = "BRAND, ARTICALE NAME, OLD PRICE, NEW PRICE\n"
file.write(headers)
#How many brands on page?
products_on_page = len(brand_Names)
#Looping through all brands, atricles, prices and writing the text into the CSV
for i in range(products_on_page):
brand = brand_Names[i].text
articale_Name = articale_Names[i].text
price = original_Prices[i].text
new_Price = new_Prices[i].text
file.write(brand + "," + articale_Name + "," + price.replace(",",".") + new_Price.replace(",",".") +"\n")
#closing CSV
file.close()
Problem: I am struggling with getting the discounted prices into my csv at the right place. Not every item has a discount and I currently see two issues with my code:
I use .findAll to look for the information on the website - as there are less discounted products then total products, my new_Prices contains fewer prices (e.g. 3 prices for 10 products). If i would be able to add them to the list, I assume they would show up in the first 3 rows. How can i make sure to add the new_Prices to the right prodcuts?
I am getting "Index Error: list index out of range" Error, which i assume is caused by the fact that i am looping through 10 products, however for new_Prices i am reaching the end quicker then for my other lists? Does that make sense and is that my assumption correct?
I am very much appreciating any help.
Thank,
Thorsten
Since some items don't have a 'div.z-nvg-cognac_promotionalPrice-3GRE7' tag you can't use the list index reliably.
However you can select all the container tags ('div.z-nvg-cognac_infoContainer-MvytX') and use find to select tags on each item.
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import csv
my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/'
client = urlopen(my_url)
page_html = client.read().decode(errors='ignore')
page_soup = soup(page_html, "html.parser")
headers = ["BRAND", "ARTICALE NAME", "OLD PRICE", "NEW PRICE"]
filename = "test.csv"
with open(filename, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(headers)
items = page_soup.find_all(class_='z-nvg-cognac_infoContainer-MvytX')
for item in items:
brand_names = item.find(class_="z-nvg-cognac_brandName-2XZRz z-nvg-cognac_textFormat-16QFn").text
articale_names = item.find(class_="z-nvg-cognac_articleName--arFp z-nvg-cognac_textFormat-16QFn").text
original_prices = item.find(class_="z-nvg-cognac_originalPrice-2Oy4G").text
new_prices = item.find(class_="z-nvg-cognac_promotionalPrice-3GRE7")
if new_prices is not None:
new_prices = new_prices.text
writer.writerow([brand_names, articale_names, original_prices, new_prices])
If you want to get more than 24 items per page you have to use a client that runs js, like selenium.
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import csv
my_url = 'https://www.zalando.de/sporttaschen-reisetaschen-herren/'
driver = webdriver.Firefox()
driver.get(my_url)
page_html = driver.page_source
driver.quit()
page_soup = soup(page_html, "html.parser")
...
Footnotes:
The naming conventions for functions and variables is lowercase with underscores.
When reading or writting csv files it's best to use the csv lib.
When handling files you can use the with statement.
Been using beautiful soup to iterate through pages, but for whatever reason I can't get the loop to advance beyond the first page. it seems like it should be easy because it's a text string, but it seems to loop back, maybe it's my structure not my text string?
Here's what I have:
import csv
import urllib2
from bs4 import BeautifulSoup
f = open('nhlstats.csv', "w")
groups=['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015","2014","2013","2012"]
for yr in year:
for gr in groups:
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/year/"+str(yr)
#www.espn.com/nhl/statistics/player/_/stat/points/year/2014/
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
pagecount = soup.findAll(attrs= {"class":"page-numbers"})[0].string
pageliteral = int(pagecount[5:])
for i in range(0,pageliteral):
number = int(((i*40) + 1))
URL = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/"+str(yr) + "/count/"+str(number)
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row =[]
for ob in range(1,15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
f.write(str(yr) +","+",".join(row) + "\n")
f.close()
this gets the same first 40 records over and over.
I tried using this solution as an if and did find that doing
prevLink = soup.select('a[rel="nofollow"]')[0]
newurl = "http:" + prevLink.get('href')
did work better, but I'm not sure how to do the loop in such a way that it advances? possibly just tired but my loop there still just goes to the next set of records and gets stuck on that one. please help me fix my loop
UPDATE
my formatting was lost in the copy paste, my actual code looks like:
import csv
import urllib2
from bs4 import BeautifulSoup
f = open('nhlstats.csv', "w")
groups=['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015","2014","2013","2012"]
for yr in year:
for gr in groups:
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/year/"+str(yr)
#www.espn.com/nhl/statistics/player/_/stat/points/year/2014/
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
pagecount = soup.findAll(attrs= {"class":"page-numbers"})[0].string
pageliteral = int(pagecount[5:])
for i in range(0,pageliteral):
number = int(((i*40) + 1))
URL = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/"+str(yr) + "/count/"+str(number)
page = urllib2.urlopen(url)
soup=BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row =[]
for ob in range(1,15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
f.write(str(yr) +","+",".join(row) + "\n")
f.close()
Your code indenting was mostly at fault. Also it would be wise to actually use the CSV library you imported, this will automatically wrap the player names in quotes to avoid any commas inside from ruining the csv structure.
This works by looking for the link to the next page and extracting the starting count. This is then used to build your the next page get. If no next page can be found, it moves to the next year group. Note, the count is not a page count but a starting entry count.
import csv
import urllib2
from bs4 import BeautifulSoup
groups= ['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015", "2014", "2013", "2012"]
with open('nhlstats.csv', "wb") as f_output:
csv_output = csv.writer(f_output)
for yr in year:
for gr in groups:
start_count = 1
while True:
#print "{}, {}, {}".format(yr, gr, start_count) # show progress
url = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/{}/count/{}".format(yr, start_count)
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
for tr in soup.select("#my-players-table tr[class*=player]"):
row = [yr]
for ob in range(1, 15):
player_info = tr('td')[ob].get_text(strip=True)
row.append(player_info)
csv_output.writerow(row)
try:
start_count = int(soup.find(attrs= {"class":"page-numbers"}).find_next('a')['href'].rsplit('/', 1)[1])
except:
break
Using with will also automatically close your file at the end.
This would give you a csv file starting as follows:
2016,"Patrick Kane, RW",CHI,82,46,60,106,17,30,1.29,287,16.0,9,17,20
2016,"Jamie Benn, LW",DAL,82,41,48,89,7,64,1.09,247,16.6,5,17,13
2016,"Sidney Crosby, C",PIT,80,36,49,85,19,42,1.06,248,14.5,9,10,14
2016,"Joe Thornton, C",SJ,82,19,63,82,25,54,1.00,121,15.7,6,8,21
You are changing the URL many times before you are opening it the first time, due to an indentation error. Try this:
for gr in groups:
url = "...some_url..."
page = urllib2.urlopen(url)
...everything else should be indented....